Yelp Dataset Challenge

The Team
Hitesh Sajnani, Vaibhav Saini, Kusum Kumar , Eugenia Gabrielova , Pramit Choudary, Cristina Lopes

Classifying Yelp reviews into relevant categories

Yelp users give ratings and write reviews about businesses and services on Yelp. These reviews and ratings help other Yelp users to evaluate a business or a service and make a choice. While ratings are useful to convey the overall experience, they do not convey the context which led a reviewer to that experience. For example, consider a yelp review about a restaurant which has 4 stars:
"They have the best happy hours, the food is good, and service is even better. When it is winter we become regulars".

If we look at only the rating, it is difficult to guess why the user rated the restaurant as 4 stars. However, after reading the review, it is not difficult to identify that the review talks about good "food", "service" and "deals/discounts" (happy hours).

A quick inspection of few hundred reviews helped us to decide important categories that are frequent in the reviews. We found 5 categories which include “Food”, “Service”, “Ambience”, “Deals/Discounts”, and “Worthiness”. "Food" and "Service" categories are easy to interpret. "Ambience" category relates to the décor, and look and feel of the place. "Deals and Discounts" category correspond to offers during happy hours, or specials run by the venue. “Worthiness” category can be summarized as value for money. Users often express the sentiment whether the overall experience was worth the money. It is important to note that "Worthiness" category is different from the “Price” attribute already provided by Yelp. "Price" captures whether the venue is “inexpensive”, “expensive” or “very expensive”.

This high level categorization of reviews into relevant categories can help user to understand why the reviewer rated the restaurant as “high” or “low”. This information can help other yelpers to make a personalized choice, especially when one does not have much time to spend on reading the reviews. Moreover, such categorization can also be used to rank restaurants according to these categories.

We formulted the task of classifying a review into relevant categories as a learning problem. However, since a review is inexclusively associated with multiple categories at the same time, it is not a simple binary classification or a multi-class classification. It is rather a multi-label classification problem.
Here is a short video describing how (and why?) Yelp can build some cool features using this categorization:


The Yelp dataset released for the academic challenge contains information for 11,537 businesses. This dataset has 8,282 check-in sets, 43,873 users, 229,907 reviews for these businesses. For our study, since we are only interested in the restaurant data, we have considered out only those business that are categorized as food or restaurants. This reduced the number of business to around 5,000.

We selected all the reviews for these restaurants that had atleast one useful vote. From this pool of useful reviews, we randomly chose 10,000 reviews. A labeling codebook describing what categories to include was developed through an initial open coding of a random sample of 400 reviews. The codebook was validated and refined based on a second random sample of 200 reviews. This exercise helped us to fix 5 categories which include food, ambience, service, deals, and worthiness. Once we identified these 5 categories, the 10,000 reviews were divided into 5 bins with repitition in each bin. 6 Graduate student researchers from our group then read and annotated each of these reviews in the identified categories. It took us approx. 225 man hours to annotate all the reviews. We identified the conflicts in the annotation of reviews among different annotators. We removed all the reviews from the analysis where there were discrepancies among the annotators. This left us with 9019 reviews. We split these annotated reviews into 80% train and 20% test data.

The review annotation process was very challenging and time consuming. We believe that it is one of the major contributions of this work. We plan to release the annotated data for researchers to extend the work.

Feature Extraction and Normalization

We extracted two types of features: (i) star ratings and (ii) textual features consisting of unigrams, bigrams and trigrams.
For star ratings we created three binary features representing rating 1-2 stars, 3 stars, and 4-5 stars respectively.
For extracting textual features, we first normalized the review text by converting it to lower–case and removing the special characters. We did not remove the stop words as they play important role to understand user sentiments. The cleaned text is then tokenized to collect unigrams (individual words) and calculate their frequencies across the entire corpus. This results in 54,121 unique unigrams. We condense this feature set by only considering unigrams with a frequency greater than 300, which results in 375 unigram features. Similarly we extract 208 bigrams and 120 trigrams.

The arff files for the the features extracted for Train and Test data can be downloaded from here: Train.arff and Test.arff

Here is a short video describing the corpus and the feature engineering


In this section, we will describe the various approaches we took to build a classifier. We will reason about our choices based on the advantage and disadvantage of each approach. You can also have a look at the video presentation, to get a quick overall idea

The problem of classifying a review into multiple categories is a not a simple binary classification problem. Since a reviewer can talk about various things in his or her review, each review can be classified into multiple categories.

One of the most popular and perhaps the simplest way to deal with multiple categories is to create a binary classifier for each category. So in our case, we create 5 binary classifiers for food, service, ambience, deals, and worthiness category. In order to do this, we need to transform the dataset into 5 different datasets where each dataset has information only about one category.

To understand consider a scaled down version of our dataset which has only 4 categories (food, ambience, service, and deals) As shown in Figure 1., we create four different dataset from this original dataset such that each dataset is only associated with a specific category. For example, The new dataset created for "food" category will have only one label. The label will be '1' for all the datapoints which had '1' for "food" in the original dataset. Similarly the dataset created for "service" will have label as '1' for all the datapoints that had service as '1' in the original dataset. This is done for all the categories present in the dataset. Given a new review, binary classifier for each category predicts if the review belongs to a category. The final prediction is the union of all the binary predictors.

Figure 1.

Once the dataset is transformed in 4 different datasets, any binary classifier like nearest neighbour, SVM, decision trees, etc. can be used with this approach. Although this approach is pretty simple and treats each category independently. However, as a consequence, it ignores correlation among categories. This assumption may not hold true, especially if the categories share some aspects with each other. For example, in our case, mostly when people get deals, they feel the restaurant is worthy to visit. This means that "deals" and "worthiness" categories are correlated. Similarly we saw correlation between "service" and "ambience" category. Sometimes the correlation may be high, and sometimes it may be very low. However, in our case, we thought it was worth accounting for.

In order to account for correlation among categories, we considered each different subset of L as a single category. Here L is the set of all the categories, a.k.a targets. So L = {Food, Service, Ambience, Deals}. For example, as shown in Figure 2, we transform the target for the Review 1, {Food, Deals} as a single target with value "1001". Similary for Review 2, {Ambience, Deals}, is tranformed into "0011". Here the pattern is formed by creating a vector which has fixed indices for each category. We then learn a multi-class classifier h : X --> P(L), where X is the review, P(L) is the powerset of L, containing all possible category subsets. This approach takes into account correlations between categories, but also suffers from the large number of category subsets. E.g., if we have 5 categories, this approach will generate 25 possible targets to predict; most of which will have only few datapoints to learn. This approach might work well if there is a large training dataset which covers all (at least most of) the possible targets to predict.

Figure 2.

We wanted to get best of both worlds i.e., consider correlation among categories and at the same time not get hit by the large number of subsets generated by the previous approach. Hence, we decided to use ensemble of classifiers where each classifier is trained using a different small subset (k) of categories. For example, let's say there are 4 categories {Food, Service, Ambience, Deals}. We choose subset size = 2. Hence, we build a total of 4C2 = 6 classifiers for the following combination of categories: {(Food,Service), (Food,Ambience), (Food, Deals), (Service, Ambience), (Service, Deals), (Ambience, Deals)}. See first part of Figure 3. For prediction, as shown in second part of Figure 3, we consider prediction of all the six classifiers and then take a majority vote.
This approach considers correlations among categories and at the same time does not generate very large number of targets by considering only a small subset of categories for each classifier.

Figure 3.


Evaluation metrics

We use Precision and Recall to measure the performance of a classifier.
To understand, what precision and recall means in our context, consider (x,Y) to be a datapoint where x is the review text and Y is the set of true categories. Y ⊆ L, where L = {Food, Service, Ambience, Deals, Worthiness}.
Let h be a classifier
Let Z = h(x) be the set of categories predicted by h for the datapoint(x, Y). Then,
Precision = |Y ∩ Z|/|Z| (Out of the categories predicted, how many of the them are true categories)
Recall = |Y ∩ Z|/|Y| (Out of the total true categories, how many of them were predicted)


We experimented with all the three approaches discussed above. We used Precision and Recall as our evaluation metrics. For each review, dThe comprehensive set of experiment configurations (different approaches, different classifiers, different feature sets, paramter settings) can be found in this result sheet.

In the first approach of using L binary classifiers, where L is the total number of categories, we used Naive Bayes, k-Nearest Neighbour, Support Vector Machines (SMO implementation), decision trees, and Neural Networks. In the figure below we report results for Naive Bayes and K-NN for this approach as only they were competitive.
In the second approach, where we consider label correlations and predict the powerset of labels. Decision trees performed the best in this category. We also experimented with ensemble of classifiers approach using decision trees that gave us the best results overall.


Yelp reviews and ratings are important source of information to make informed decisions about a venue. We conjecture that further classification of yelp reviews into relevant categories can help users to make an informed decision based on their personal preferences for categories. Moreover, this aspect is especially useful when users do not have time to read many reviews to infer the popularity of venues across these categories. In this paper, we demonstrated how reviews for restaurants can be automatically classified into five relevant categories with precision and recall of 0.72 and 0.71 respectively. We found that an ensemble of two multi-label classification technique (Binary Relevance and Label Powerset) performed better than the techniques individually. Moreover, there is no significant difference in performance when using a combination of bigrams, unigrams and trigrams instead of only unigrams. We also showed how the results of this study can be incorporated into Yelp’s existing website.

Technical Report

Multilabel Calssification of reviews in Yelp data

Download Presentation

Multilabel Calssification of reviews in Yelp data

comments powered by Disqus