COMP 791A

Statistical Language Processing

Fall 2006

 

 

*** N E W S ***

 

Dec. 22: Here are your final marks. 

-         If there is any discrepancy with your records, let me know ASAP.

-         If you can’t find your mark for the project, check your e-mail. 

-         The letter grade will NOT be posted here.  I am not allowed to.

 

 

Dec. 12: More on the project.  Now that the exam is over, let’s concentrate on the project.

  • The deadline is next Monday (Dec. 18, midnight)
  • Please hand in a paper copy of the report (under my door) and an electronic version of everything else
  • There will be no official presentation
    • I will read your reports first.  If I have any questions, I will e-mail you to come and answer my questions verbally.
    • If I don’t send you an e-mail, then the report was clear, and you don’t need to present anything.

 

Nov. 30: Topics covered in the exam:

  • ALL my slides from n-gram models to machine translation.
  • In the book, this corresponds to (but my slides contain more information):

chap 6: all sections except sections 6.2.5 and 6.2.6

chap 7: all sections except section 7.4

chap 10: all sections except sections 10.3, 10.4.3, 10.4.4 and 10.5

chap 13: all sections except pages 481-483 (Chen and Haruno & Yamazako)

chap 15: only sections 15.1 and 15.2 (but my slides are more complete)

 

  • I’ll be in my office Monday afternoon starting at 1pm.  If you have any questions, feel free to drop by.

 

Nov. 30:  More info about the final exam.

-         Monday, Dec. 11 from 18:00-21:00 in FG-355 (Faubourg Ste-Catherine)

-         It is not cumulative

-         It will cover the following topics:

o       a little bit of n-gram modeling

o       WSD (Word Sense Disambiguation)

o       POS tagging (Part of Speech tagging)

o       Machine Translation

o       IR (Information Retrieval)

-         Although you will be given 3 hours, you won’t need the full period.

Nov. 28:  More results of presentations

-         presentation aspects

-         technical aspects

 

Nov. 22:  If you forgot when you meet me to discuss your project, I have posted the schedule on my office door.

 

Nov. 20: According to your project progress reports, most of you won’t finish your project for the December 4th deadline. So the deadline for the project has been extended by 2 weeks.  It will be due Monday Dec. 18. No further extension will be given, however.

 

Nov. 7: Now that enough students have done the presentation, here are your marks compared to the others.

 

Note that I changed the weights of the qualitative assessments.

 

 

Old mapping

New mapping

very good

3/3 (100%)

100%

good

2/3 (66%)

80%

OK

1/3 (33%)

60%

bad

0/3 (0%)

40%

 

I changed them, because in real life, “OK” means “passing”, yet the old mapping was below 50%... It does not change the ranking, but the new mapping corresponds better to the expected meaning the “good” and “OK”…

 

Anyways, here is the new data.

-         presentation aspects

-         technical aspects

 

Nov. 7: I remind you that next Monday (November 13):

-         is the due date for assignment 1

-         is the due date for the project progress report.

Please slip the paper version under my office door (EV 3.117)

 

 

Nov. 1: The final exam has been scheduled.  It will be: Monday, Dec. 11 from 18:00-21:00 in FG355

 

Oct. 27: 

 

Oct. 18: Note on the presentation feedback.  Some students are very generous with comments (great!), while others give minimal feedback…

 

Every time you fill-in an evaluation for a classmate’s presentation,

 

1.      I type up your comments, to give feedback to your classmate

…so if you want lots of feedback on your own presentation, it’s only fair that you give lots of feedback to your classmates too !

 

2.      I take note of whether you actively “participate” or not

…and this goes into your participation grade

 

Oct. 18: Exam 1 will cover the following material:

·         ALL slides until and including n-gram models

·         in the book, this corresponds to:

chap 1: all sections

chap 3: all sections

chap 4: all sections

chap 5. all sections except 5.3.4

chap 6: all sections except sections 6.2.5 and 6.2.6

 

 Oct. 12: Exam 1

    • When? Monday October 23, from 5:45pm to 7pm (1h15min)
    • Where? In class
    • What? It will cover the material seen in class up-to and including language models (i.e. the material in grey in the table below)
    • Material allowed?
      1. you can bring a calculator
      2. no notes or books will be allowed.
    • There will be 2 types of questions:
      1. “explanation-based questions” e.g. what are the advantages of this? If we change this, what would happen?
      2. “fact-based question” e.g. given some data, compute this or that.
    • Note: You don’t need to learn formulas by heart.  If you need a formula, it will be given to you.

 

Oct. 12: Someone was interested in a Perl tutorial for NLP.  If you need it too, here it is.

 

Oct. 5: Project:

-          Your project description is due Oct. 16.

-          If you need ideas or pointers, or want to discuss it before hand, do not hesitate to contact me!

 

Oct. 2: Assignment 1:

1.The CMU-Cambridge Statistical Language Modeling toolkit

2.The SRI Language Modeling Toolkit

3.The Lemur Toolkit

 

 

Oct. 1: A more clear set of slides on the t-test  (see slides 39-43) 791-04-Collocations-new.ppt

 

Sept. 21: more details about the project

 

Sept. 14: more details about the paper presentation + sample peer evaluation sheet

 

General Information:

  • What every graduate student should know about writing a research paper:

Avoiding plagiarism at Concordia

 

Required Book

 

[Manning and Schütze, 1999] Christopher D. Manning and H. Schütze (1999), Foundations of Statistical Natural Language Processing. The MIT Press.

List of errors in the book

 

Another very good book

 

[Jurafsky and Martin, 2000] Speech and Language Processing An introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Daniel Jurafsky & James H. Martin. Prentice Hall, 2000. 

List of errors in the book

 

 

Online References:

 

 

The course, week by week...

 

Day / Topic

References

Slides for the class

Suggested Readings

Online Resources

Monday September 11: Introduction to NLP

-   [M&S]: Chap. 1

-   [J&M]: Chap. 1

 

791-01-Intro.ppt

Zipf's Law and Random Texts

-  A good place to start

-  Wikipedia on Zipf's law

 

Monday September 18: Linguistic Essentials + Corpus-Based work

- [M&S]: Chap. 3 & 4

- [J&M]: Section 8.1, 9.1-9.8

 

791-02-Ling.ppt

791-03-Corpus.ppt

Who wrote the 15th Book of Oz?

 

-   Wikipedia on what's a word?

Monday September 25: Collocations

 

-   [M&S]: Chap.5 (please note the errors in the book)

791-04-Collocations.ppt

 

A more clear set of slides on the t-test  (see slides 39-43) 791-04-Collocations-new.ppt

Automatic Identification of Non-Compositional Phrases

 

Language is never, ever, ever random

-   A collocation extractor software

Monday October 2: 

n-grams

-   [M&S]: Chap. 6

-   [J&M]: Chap. 6

 

Nasim: Automatically Categorizing Written Texts by Author Gender

791-05-ngrams.ppt

Testing the Efficacy of Part-of-Speech Information in Word Completion

 

Using the Web to Obtain Frequencies for Unseen Bigrams

-   A random Text Generator

-   N-gram Models from Google

Monday October 9: Thanksgiving - No Class

 

 

 

 

Monday October 16:

n-grams (con’t)

 

Word-Sense Disambiguation

 

 

-  [M&S]: Chap. 7

-  [J&M]: Sections 17.1 & 17.2

 

George: N-Gram and N-Class Models for On line Handwriting Recognition

 

Francis: Viewing Sentence Boundary Detection as Collocation Identification

 791-06-WSD.ppt

 

-  Wikipedia on WSD

-  SENSEVAL home page

Monday October 23:

 

Exam 

 

Word-Sense Disambiguation  (con’t)

 

 

  

 

791-06-WSD.ppt

 

 

Monday October 30:  

 

Word-Sense Disambiguation  (con’t)

 

- [M&S]: Chap. 10

- [J&M]: Chap. 8

Christos An Automatic Method for Generating Sense Tagged Corpora

 

Safeya: Representing Discourse Relations: A Corpus-Based Study

 

Jehad: Word Sense Disambiguation and Information Retrieval

791-06-WSD.ppt

 

 

 

Monday November 6:

 

Part-of-Speech Tagging

 

 

Jiewen: Event Extraction from Biomedical Papers using a Full Parser

 

Ishrar: Learning Subjective Language

 791-07-pos-short.ppt

The GRACE French Part-of-Speech Tagging Evaluation Task

 

Online part-of-speech taggers

 

Monday November 13:

 

No class.

I’ll be at the TREC conference.  I’ll bring you goodies…

 

 

 

 

 

 

 

Monday November 20:

 

Text Alignment and Machine Translation

 

 

[M&S]: Chap. 13

[J&M]: Chap. 21

Shamima: Word-Sense Disambiguation for Machine Translation

 

Amit  Building HyperText Links from Semantic Similarity

 791-10-mt.ppt

 

-         Try Systran

-         Try Google Translate

-         Aligned Hansards

-         D.J. Arnold, Lorna Balkan, Siety Meijer, R.Lee Humphreys and Louisa Sadler Machine Translation: an Introductory Guide, Blackwells-NCC, London, 1994, ISBN: 1855542-17x.

-         Workshop on Statistical Machine Translation at  Johns Hopkins University

Monday November 27:

Information Retrieval

 

-     [M&S]: Sections 15.1 & 15.2

-    [J&M]: Section 17.3

Majid : A little known fact is… answering Other questions using interest-markers

 

Reda: Mining and Summarizing Customer Reviews

 791-11-ir.ppt

 

 

-          Free Online book: C. J. van Rijsbergen (1979), Information Retrieval, Second Edition, Butterworth & Co., Ltd., London,  (ISBN: 0-408-70929-4) 

-         Online IR resources 

-         ACM Special Interest Group on IR (SIGIR) 

-         TREC

-         The Porter Stemmer Web page

 

Monday December 4:

Probabilistic Grammars and Parsing

OR

Topic of your choice

 

-   [M&S]: Chap. 11 & 12

 

Shahin: Tagging gene and protein names in biomedical texts

 

Ravi: The Anatomy of a Large-Scale Hypertextual Web Search Engine

 

Julien: Using Cognates to Align Sentences in Bilingual Corpora

 

Aiguo Zhu: Prepositional Phrase Attachment through a Backed-Off Model

 

 

 Prepositional Phrase Attachment through a Backed-Off Model

 

Disambiguation of English PP Attachment using Multilingual Aligned Data

 

Parse Visualization Tools

 

RASP parser

 

- Resources of Text Categorization

-Standard test collections