This assignment is due Wednesday, December 7 at 11:59PM.

Goals

Through this assignment you will:

Background

Please review the class slides and readings in Jurafsky and Martin 3rd ed. Chapter 22 on shallow discourse parsing and the Penn Discourse Treebank.  

For additional information on the CoNLL16 json file format for shallow discourse parsing data: the CoNLL16 data format tutorial

Implementing Coherence Relation Sense Classification for Shallow Discourse Parsing

Based on the examples in the text, class slides, and other resources, implement a program to perform coherence relation sense classification, one of the steps in shallow discourse parsing. Specifically, your program should:

  1. Read in Glove embedding vectors from the provided file.
  2. Load training and test coherence relation classification data from a provided subset of CoNLL resource files
  3. Create training and test classification vectors from the data.  For each shallow discourse parsing instance:
    1. For  the Arg1 and Arg2:
      1. Tokenize the raw text, ideally using NTLK.word_tokenize()
      2. Using the corresponding Glove embeddings of the tokens, create averaged vector representation of the Arg
    2. Concatenate the Arg1 and Arg2 representation to make the classification vector
  4. Write the training and test instances to respective files in comma separated value format, with the sense of the instance as the last element in each line
  5. Train a classifier on the training instances. You can use whatever method for classification you'd like, including any of the classifiers in scikit-learn
  6. Test on the test instances.  Writing to the output file
    1. The overall per-class F-measure
    2. For each test instance: true_label\tpredicted_label

Programming

Create a program hw9_coherence.sh that implements the coherence relation sense classification as specified as above invoked as:

hw9_coherence.sh <glove_embedding_file> <relation_training_data_file> <relation_testing_data_file> <training_vector_file><testing_vector_file> <output_result_file>

Files

All files are found in /dropbox/22-23/au571/hw9/ on patas:

Test, Gold Standard, and Example

Submission Files

[Back to Top]