Examples of Data Sets for Text Analysis

CS 175, Winter 2017

The links below point to just a few of the many data sets for text analysis that you can find on the Web, and should help you in terms of finding data sets to work on for your projects.

Data Sets with Classification Labels or Ratings
Yelp Data Set Challenge (2.2M reviews of businesses from over 500k users in 10 cities)
    (and here's a pointer to work from our own group at UCI that recently won the Round 5 Challenge)
Kaggle Data Sets. Contains multiple data sets with text content. Kaggle is a company that hosts data mining/prediction competitions
Movie review data for sentiment analysis, from Pang and Lee, Cornell
Product review data from Johns Hopkins University  (goal is to predict ratings on scale of 1 to 5)
A variety of different text data sets from the UCI Machine Learning Repository (many already in the "bag of words" format)
Data Sets on "learning to rank" (for Web search)
All of Wikipedia (can be used to build classifiers using category labels or to provide additional information for other models such as n-gram statistics)
Various text and Web-related data sets from Yahoo! Labs (note that these data sets can also be used for unsupervised learning, such as clustering or topic modeling, by ignoring the class labels during training).
Document classification data sets (a large collection of different data sets used in text classification research)

Other Interesting Text Data Sets (often used for Clustering and other Exploratory Methods)
Enron email data set, from CMU (note that there are other "cleaner" versions available on the Web if you search...)
Python code for downloading IMDB (Internet Movie Database), with 425k titles and 1.7 million filmographies of cast and crew
A survey of data sets available for building data-driven dialogue systems
Book Summaries Corpus
Full text of US patents from 1980 to 2015, from the USPTO (US Patent and Trademark Office), hosted by Google
Very large data set of all Reddit submissions between 2006 and 2015

Data Sets used to build Language Models and Auto-complete Algorithms
Ngram data from Peter Norvig (Google), with an accompanying tutorial book chapter
Google ngrams, and Google syntactic ngrams over time, from Google books

Question-Answering Data Sets
WikiQA , a data set for "open-domain" question answering, from Microsoft Research
Question-Answering Data Sets from TREC (funding by the National Institute of Standards and Technology, NIST)
Question Answering Corpus from DeepMind (part of Google)
The Allen AI Science Challenge on Kaggle (competition ended in 2016)
The BioASQ data sets and challenge competitions on question answering for the biomedical domain

Ontologies/Structured Data (useful for Information Extraction/Annotation)
The DBpedia Data Set