This is an implementation of the inside-outside algorithm for estimating the probabilities of PCFG productions. While there are several places where better indexing might improve performance, I have paid attention to ensuring that the algorithm runs in n^3 time and in time linear in grammar size. I have used this program with grammars with tens of thousands of productions, trained using tens of thousands of sentences (it may take a week).
This program may be freely used for research purposes; however, I do request acknowledgement in any publications containing results that were produced with this program or code derived from this program.
io [-d debuglevel] [-R nruns] [-a alpha] [-g grammar] [-s stoptol] [-p prune] [-l maxlen] [-m minits] [-n maxits] [-S randseed] [-j jitter] [-J per-iteration-jitter] [-b annealstart] [-B annealstop] [-N nanneal] [-T tracefile] yieldfile
where:
The grammar should consist of rules, one per line, of the format
where weight is the initial rule probability used to start EM, and bias is a pseudo-count added to this rule at each iteration
Setting optimization flags makes a big difference, as the program spends a lot of its time in tight loops. I find I get a big speed up by turning on as much optimization as possible.
The top Makefile target (testrun) runs the program on small test examples.