For the second practical assignment, you study the likelihoods assigned to sentences from the WSJ corpus (up to length 40) by different probabilistic models: bigram over words, bigram over POStags and treebank PCFG. Because ngrams can be implemented in the PCFG formalism, we can in fact treat these three models all as treebank grammars extracted from three different treebanks. The treebanks you need are these three: wsj01-21-right-branching-m40.txt wsj01-21-right-branching-w-postags-m40.txt wsj01-21-without-tags-traces-punctuation-m40.txt You need to give short descriptions of these treebanks, and explain which treebank grammars they define. You then calculate the data likelihood under each of these grammars. Report the total likelihood of the corpus (i.e., the sum of the log-likelihoods of each sentence), and the 6x 5 sentences for which the likelihood ratios under the three grammars are largest, and try to explain why this is the case. (i.e., compute the likelihood ratios L1/L2, L2/l1, L1/L3, L3/L1, L2/L3, L3/L2 for each sentence, and pick the top 5 for each ratio). Hand in two pages of A4 (on Blackboard) with all descriptions and tables on Thursday February 21st, before class. (Optional, but more interesting, is to evaluate likelihood of unseen sentences; ask me for help about how to deal with unknown words). To do all this, you need a working parser, and to know how to extract the treebank PCFG from a corpus. Given that this is an advanced MSc AI course, the following instructions are very minimal, but ask for help if you need help! First, download and install BitPar: workdir=/home/yourname/ull14 mkdir $workdir cd $workdir wget http://www.cis.uni-muenchen.de/~schmid/tools/BitPar/data/BitPar.tar.gz gunzip BitPar.tar.gz tar xvf BitPar.tar cd BitPar/src make cd $workdir ln -s ../../BitPar/src/bitpar bitpar Download treebanks.zip to the same folder and unpack it with the password from the email or on the whiteboard. Then follow my example below to run BitPar on your very first grammar. Make sure you understand each of the steps and intermediate products so that you can make changes to it yourself. You can visualize the treebanks by running wget http://homepages.inf.ed.ac.uk/fsangati/Viewers/ConstTreeViewer_13_05_10.jar java -jar ConstTreeViewer_13_05_10.jar and loading the treebank you want to see. *** Example PCFG extraction I start with a small treebank sample-bigram-w-postag.txt that has 1000 sentences annotated with right-branching trees, and includes the original WSJ postags. I first remove the double quotes and extract the treebank PCFG: cat sample-bigram-w-postag.txt | sed 's/"//g' > sample-bigram-w-postag2.txt java -jar PCFG_extractor.jar sample-bigram-w-postag2.txt sample-bigram-w-postag3.txt Next, I reorder the information in the grammar file format that BitPar requires: less sample-bigram-w-postag3.txt | sed 's/%/%%/g' | awk '{printf ($NF " "); for (i=1;i sample-bigram-w-postag4.txt (the s/%/%%/g is necessary because % is a reserved symbol for awk). Then, I extract a dummy 'lexicon' file from it in the format that BitPar requires less sample-bigram-w-postag4.txt | grep -E '"[^"]+"' | sed 's/\"//g' | awk '{print($3"\t"$2"\t"$1)}' > lexicon (this might take a while for large grammars because of the sorting). Finally, I create the 'grammar' file by removing the remaining double quotes: less sample-bigram-w-postag4.txt | sed 's/"//g' > grammar Now we're ready to parse. First I create the corpus to parse: head -3 sample-bigram-w-postag.txt | tail -1 | sed 's/TOP/\n\n/;s/"\([^"]\+\)"/\n\1\n/g' | grep -v '[()]' > corpus (The TOP/\n\n business is to make sure there is a white space at the end of the sentence. To parse multiple sentences you'll need to replace the head..tail construct with a simple less and remove the first empty lines). And then we call BitPar to find the likelihood of the single sentence in the corpus: ./bitpar -vp -s TOP -o grammar lexicon corpus To get the likelihoods, look at the so-called viterbi probabilities (or the viterbi prob of the TOP node in the parse forest). I.e.: ./bitpar -b 1 -vp -s TOP -o grammar lexicon corpus results less results | grep -E '[0-9]\.[0-9]*e-[0-9]*' -o You can save the result in a file (add: > filename.csv) and open it, for instance in Excell or OpenOffice, or process it further with awk.