To remove any frames surrounding this page,
click here
Open-source software written by
Mark Johnson
This software is open-source, but I do request acknowledgement whenever
results produced using this software is used or incorporated into other software.
This is research software, and while I have tried to write it as well
as I can, it may still contain bugs, so users beware!
If you find any bugs, please let me know.
I believe that the programs compute what I claim they compute, but I do not guarantee this.
The programs may be poorly and inconsistently documented and may contain undocumented components, features
or modifications. I make no guarantee that these programs will be suitable for any application.
This software is provided AS IS with NO SUPPORT.
These programs have no warranty, guarantee, express or implied representation of any kind whatsoever. All other warranties including, but not limited to, merchantability and fitness
for purpose, whether express, implied, or arising by operation of law, course of dealing, or trade usage are hereby disclaimed.
- The Pitman-Yor adaptor grammar sampler
from our 2006 NIPS paper.
- Gibbs and Hastings samplers for PCFGs
(these are MCMC algorithms for computing the Bayesian version of what the
Inside-Outside algorithm computes). The
Inside-Outside PCFG estimator available below
has been updated to optionally use Variational Bayes, so it provides an
alternative way of computing the same thing as these samplers.
As far as I can tell, on small data sets the samplers are more accurate,
but on large data sets (say, more than 1 million tokens) the Variational
Bayes estimators converge much faster.
- A gzipped tar archive
containing the reranking parser
(version of August 2006) primarily written by Eugene Charniak and me
(with the assistance of many people, e.g., Matt Lease), as
described in
Eugene Charniak's and my ACL 2005 paper
and my 2005 CoNLL talk.
With some feature tweaking
it's now getting 91.4% f-score on section 23!
This archive contains code for completely retraining the reranker
from scratch, including:
-
constructing the 20 folds of 50-best parses,
- extract features from these 50-best parses, and
- estimate the reranker feature weights using MaxEnt, Averaged Perceptron,
etc.
You will need your own copy of the Penn Treebank and a machine with
4-8GB RAM to retrain the reranker (see the README and Makefiles).
Once trained, the parser+reranker should run in about 1/2 GB RAM
(and the tar file above includes a fully trained model which you
should be able to run out of the box).
The code is stored in a gzipped archive file. Download the archive file,
decompress it with gunzip and unpack it with tar.
If you are using GNU tar, you can decompress and unpack in
one step using tar -zxvf.
For fun, try the reranking parser on the Brown corpus!
Even though the reranking parser is trained only on WSJ,
it actually does surprisingly
well on Brown (better than any other parser, as far as I know,
including parsers trained on Brown). This suggests that we aren't
overtraining on our training data.
The "nonfinal" features and data used in the ACL 2007 paper with
Jianfeng Gao, Galen Andrew and Kristina Toutanova (of Microsoft
research) can be reconstructed by setting "VERSION=nonfinal" in
the top-level Makefile, or just downloaded from
here.
If you're interested in writing new features for the reranker,
the following talk slides
may be helpful.
- The empty node restorer program
(C++ code, in a bzip2'd tar file)
from my ACL 2002 paper
A simple pattern-matching algorithm for recovering empty nodes and their antecedents.
You can also read the README
file.
- A C implementation of the
Inside-Outside algorithm for
estimating PCFGs from terminal strings.
This has an option to use Variational Bayes estimation (the -V flag) in place
of the Maximum Likelihood estimation used in the Expectation-Maximization
algorithm, which makes it comparable to the Gibbs PCFG estimators above.
This program
assumes that all terminals are introduced by unary rules.
(last updated 2nd September 2007).
You can also download
C code for the Digamma function used
in this program.
-
cky.tbz contains a very fast C implementation
of a CKY PCFG parser, together
with programs for extracting PCFGs from treebanks, etc.
This was used in my 1999 CL article.
(last updated 6th March, 2006)
- A minimum edit distance alignment
program in C++, and its readme file.
The useful part of this is actually the header file med.h, which
contains a generic dynamic programming aligner.
- The tuple-finding software
used to find collocations in:
Don Blaheta and Mark Johnson (2001)
``Unsupervised learning of multi-word verbs.''
This is C++ code, and the way the g++ compiler handles
namespaces has changed dramatically
over the last few months.
I'm ashamed to say I can't remember what version of g++ this
compiles under, but I think that all you'll need to get this
to compile under the latest g++ are a few using namespace std
and using namespace gnu_ext declarations.
-
This collocation-finding paper references an
unpublished draft paper on finding
``surprising'' pairs; you can get the
the associated code for
the exact binomial and the
odds ratio interval estimators as well.
-
While the code is probably more than a decade old,
I still get requests for the LALR parser generator code I wrote in
CommonLisp (lalrparser.lisp) and in
Scheme (lalr.ss). I still get email from
people thanking me for this stuff, so bitrot doesn't seem to
have affected it yet!
-
This isn't really software, but the list of incorrect
parses produced by the discriminative parser (sorted so that the worst sentences come first)
may be of interest to some of you.