For the second practical assignment, you study the
likelihoods assigned to sentences from the WSJ corpus (up to length
40) by different probabilistic models: bigram over words, bigram over
POStags and treebank PCFG. 

Because ngrams can be implemented in the PCFG formalism, we can in
fact treat these three models all as treebank grammars extracted from
three different treebanks. The treebanks you need are these three:
   wsj01-21-right-branching-m40.txt
   wsj01-21-right-branching-w-postags-m40.txt
   wsj01-21-without-tags-traces-punctuation-m40.txt

You need to give short descriptions of these treebanks, and explain
which treebank grammars they define. You then calculate the data
likelihood under each of these grammars. Report the total likelihood
of the corpus (i.e., the sum of the log-likelihoods of each sentence),
and the 6x 5 sentences for which the likelihood ratios under the three
grammars are largest, and try to explain why this is the case. (i.e.,
compute the likelihood ratios L1/L2, L2/l1, L1/L3, L3/L1, L2/L3, L3/L2
for each sentence, and pick the top 5 for each ratio).  Hand in two
pages of A4 (on Blackboard) with all descriptions and tables on Thursday 
February 21st, before class.

(Optional, but more interesting, is to evaluate likelihood of unseen
sentences; ask me for help about how to deal with unknown words).

To do all this, you need a working parser, and to know how to extract
the treebank PCFG from a corpus. Given that this is an advanced MSc
AI course, the following instructions are very minimal, but ask for
help if you need help! 

First, download and install BitPar:
workdir=/home/yourname/ull14
mkdir $workdir
cd $workdir
wget http://www.cis.uni-muenchen.de/~schmid/tools/BitPar/data/BitPar.tar.gz
gunzip BitPar.tar.gz
tar xvf BitPar.tar
cd BitPar/src
make
cd $workdir
ln -s ../../BitPar/src/bitpar bitpar

Download treebanks.zip to the same folder and unpack it with the
password from the email or on the whiteboard. Then follow my example
below to run BitPar on your very first grammar. Make sure you
understand each of the steps and intermediate products so that you can
make changes to it yourself.

You can visualize the treebanks by running
wget http://homepages.inf.ed.ac.uk/fsangati/Viewers/ConstTreeViewer_13_05_10.jar
java -jar ConstTreeViewer_13_05_10.jar
and loading the treebank you want to see.

***

Example PCFG extraction

I start with a small treebank sample-bigram-w-postag.txt that has 1000
sentences annotated with right-branching trees, and includes the
original WSJ postags.

I first remove the double quotes and extract the treebank PCFG:
  cat sample-bigram-w-postag.txt | sed 's/"//g' > sample-bigram-w-postag2.txt 
  java -jar PCFG_extractor.jar sample-bigram-w-postag2.txt sample-bigram-w-postag3.txt

Next, I reorder the information in the grammar file format that BitPar requires:
  less sample-bigram-w-postag3.txt | sed 's/%/%%/g' | awk '{printf ($NF " "); for (i=1;i<NF;i++) printf ($i " "); printf ("\n");}' > sample-bigram-w-postag4.txt

(the s/%/%%/g is necessary because % is a reserved symbol for awk). 

Then, I extract a dummy 'lexicon' file from it in the format that BitPar requires
less sample-bigram-w-postag4.txt | grep -E '"[^"]+"' | sed 's/\"//g' | awk '{print($3"\t"$2"\t"$1)}' > lexicon

(this might take a while for large grammars because of the sorting).

Finally, I create the 'grammar' file by removing the remaining double quotes:
  less sample-bigram-w-postag4.txt | sed 's/"//g' > grammar

Now we're ready to parse. First I create the corpus to parse:
head -3 sample-bigram-w-postag.txt | tail -1 | sed 's/TOP/\n\n/;s/"\([^"]\+\)"/\n\1\n/g' | grep -v '[()]' > corpus

(The TOP/\n\n business is to make sure there is a white space at the end of the
sentence. To parse multiple sentences you'll need to replace the
head..tail construct with a simple less and remove the first empty lines).

And then we call BitPar to find the likelihood of the single sentence in the corpus:
  ./bitpar -vp -s TOP -o grammar lexicon corpus

To get the likelihoods, look at the so-called viterbi probabilities
(or the viterbi prob of the TOP node in the parse forest). I.e.:
  ./bitpar -b 1 -vp -s TOP -o grammar lexicon corpus results
  less results | grep -E '[0-9]\.[0-9]*e-[0-9]*' -o
  
You can save the result in a file (add: > filename.csv) and open it,
for instance in Excell or OpenOffice, or process it further with awk.