Project-Related Reference Material for CS 175

CS 175, Fall 2022
Below are links to suggested reading organized by topic. If you are doing a project on any of these topics (or interested in potentially doing a project on these topics) then these online resources should be helpful.

A very useful general source of information is the Website paperswithcode.com which provides an organized list of potential project topics with many links to relevant research papers and datasets.


Text Classification
Chapters on logistic regression for text and neural network classifiers and language models from Jurafsky and Martin, 3rd ed., 2022
Chapters on text classification and naive Bayes and vector-based classification from Manning et al, 2009
Comprehensive survey paper on text classification algorithms by Aggarwal and Zhai (2012)
Neural Network Methods for Natural Language Processing, Yoav Goldberg, 2017. Covers multiple aspects of neural networks for text analysis.
Overview of general principles in machine learning from Goodfellow et al (2016)
Tutorial paper on multi-label classification methods by de Carvalho and Freitas

Sentiment Analysis
Chapter on naive Bayes and sentiment classification and lexicon-based methods for sentiment analysis from Jurafsky and Martin
Very extensive tutorial materials on sentiment analysis by Christopher Potts including detailed instructions about using word lexicons.
Survey paper on sentiment analysis by Pang and Lee (2008)
Text on sentiment analysis and opinion mining by Liu (2012)

Language Models
Chapters on n-gram language models from Jurafsky and Martin, 3rd ed., 2022

Sequential Models and Recurrent Neural Networks
Chapters on recurrent neural networks and encoder-decoder models from Jurafsky and Martin, 3rd ed., 2022
Chapter on recurrent and recursive neural networks from Goodfellow et al (2016)
Interesting blog post on recurrent neural networks by Andrej Karpathy (2015)

Chatbots
Chapter on dialog systems and chatbots from Jurafsky and Martin
Chatbot tutorial in Pytorch
Overview of the Microsoft Cortana dialog management system, Sarikaya et al, 2016

Vector Embeddings and Topic Models
Chapter on dense vector representations and embeddings for words from Jurafsky and Martin
Short text on topic models by Boyd-Graber, Hu, and Mimno (2017). Chapter 1 provides a brief introduction to topic modeling
Overview paper on topic modeling by Dave Blei (2012) and his Webpage on topic modeling
Chapter on latent semantic indexing from Manning et al

Automated Speech Recognition (ASR)
Chapter on automatic speech recognition from Jurafsky and Martin
Python speech recognition library
Blog posts on using the Kaldi ASR system and speech recognition in Python in general

Text Summarization
Review paper from 2020 on recent approaches for automatic text summarization.
Another recent (2020) review paper on automatic text summarization.
Older but comprehensive survey of text summarization techniques by Nenkova and McKeown, from 2012
Research paper on specialized techniques for summarization of short texts (such as reviews) from authors at Microsoft Research and collaborators (2016)

Natural Language Generation, Text Synthesis
Recent detailed survey (2020) covering many different approaches to text generation.
Survey of recent research in natural language generation methods, Gatt and Krahmer (2018)

Question Answering
Chapter on techniques for automated question-answering systems from Jurafsky and Martin
Wide variety of datasets and papers on question-answering systems
Paper on toy tasks for developing question-answering systems by Weston et al (2016)

Information Extraction
Chapter on information extraction from Jurafsky and Martin
Research paper on extracting information about different aspects of product from reviews by Zha et al (2014)
Research paper on extracting information from scientific articles

Document Clustering
Chapters on flat clustering algorithms and hierarchical clustering algorithms for text documents, from Manning et al
A technical report describing a systematic comparison of text document clustering techniques.