Examples of Data Sets for Text Analysis and NLP Projects

CS 175, Fall 2022

The links below point to just a few of the many data sets for text analysis that you can find on the Web, and should help you in terms of finding data sets to work on for your projects. Note that these are just some examples of many publicly-available text datasets that are available - please feel free to use other datasets that you find (or create) beyond those listed below.

Text Classification and Sentiment Analysis
Multiple text classification datasets from NLP-progress
Multiple sentiment analysis datasets from NLP-progress
Yelp Data Set Challenge (8 million reviews of businesses from over 1 million users across 10 cities)
Kaggle Data Sets with text content (Kaggle is a company that hosts machine learning competitions)
Labeled Twitter data sets from (1) the SemEval 2018 Competition and (2) Sentiment 140 project
Amazon Product Review Data from UCSD. This is a very large and rich data set with review text, ratings, votes, product metdata, etc. The full dataset is extremely large - some of the smaller subsets provided may be better for class projects.
IMDB Moview Review Data with 50,000 movie reviews and binary sentiment labels
Well-known Movie review data for sentiment analysis, from Pang and Lee, Cornell
Product review data from Johns Hopkins University  (goal is to predict ratings on scale of 1 to 5)

Dialog/Conversation/Chatbots
A repository of large datasets for models of conversational response
A survey paper on data sets available for building data-driven dialogue systems
Amazon Topical Chat Dataset with accompanying research paper and blog post from Amazon.
ConvAI2 Competition Dataset
Multiple labeled dialog/chatbot datasets from NLP-progress
Cornell Movie-Dialogs Corpus
Transcripts from the TV series "The Office" (formatted for the R language)

Language Models and Auto-complete Algorithms
Language modeling datasets from NLP-progress
Ngram data from Peter Norvig (Google), with an accompanying tutorial book chapter
Google ngrams, and Google syntactic ngrams over time, from Google books

Question-Answering Datasets
Multiple question-answering datasets from NLP-progress
WikiQA , a data set for "open-domain" question answering, from Microsoft Research
Question-Answering Data Sets from TREC (funding by the National Institute of Standards and Technology, NIST)
Question Answering Corpus from DeepMind
The Allen AI Science Challenge on Kaggle (competition ended in 2016)

Summarization
Multiple summarization datasets from NLP-progress

Other Interesting Text Data Sets (could be used for multiple different types of projects)
Enron email data set, from CMU (note that there are other "cleaner" versions available on the Web if you search...)
CMU Movie Summary Corpus
Book Summaries Corpus
Full text of US patents from 1980 to 2015, from the USPTO (US Patent and Trademark Office), hosted by Google
(Could be used for example to detect trends and changes in patent language and concepts over time) Very large data set of all Reddit submissions between 2006 and 2015
Data Sets on "learning to rank" (for Web search, from Microsoft Research)
All of Wikipedia (can be used to build classifiers using category labels or to provide additional information for other models such as n-gram statistics)
Various text and Web-related data sets from Yahoo! Labs (these data sets could be used for different tasks).
The DBpedia Data Set (an example of a large-scale ontology/knowledge-base)