CS175: Project in AI

Spring Quarter, 2010

[Return to main page]

Potential Project Ideas:

Here are some potential projects -- we encourage students to select from this list:

Text analysis
Probabilistic modeling of music
Automated analysis of computer code
Yahoo's "Learning To Rank" challenge
Recommender systems (collaborative filtering)
Social network analysis
Time series analysis and anomaly detection
EEG data analysis
Plagiarism and vandalism detection
Sports statistics and prediction
KDD Cup

Note that these projects are speculative and open-ended in nature -- it is up to each student to carve out a interesting task and to discover appropriate machine learning algorithms for solving that task. Students may also propose machine learning projects that are not on this list. For examples of machine learning projects carried out by undergrads at Stanford University, see here.

Text analysis:

The automated semantic analysis of text data, such as news articles, research papers, blog posts, and web pages, is of great interest to academia and industry (e.g. Google, Microsoft, Yahoo). One way to analyze text corpora in an unsupervised way is to perform "topic modeling" in order to discover meaningful semantic topics. See a simple demo here and a simple browser based on research papers here.

Topic modeling on text documents is relatively well established -- see Prof. Smyth's page for resources on topic modeling. One possible project is to try to improve the quality of topics learned on text documents. Currently, topic modeling algorithms only make use of document boundaries in order to learn topics. One straightforward way to improve the quality of topics is to use section and paragraph boundaries in the text. Another interesting possibility is to distinguish the words in each document based on part-of-speech (verb, noun, etc.) and to learn part-of-speech topics, such as "verb topics" (e.g. see Boyd-Graber and Blei, 2009). When coupled with the other idea above, this could potentially provide more semantic meaning within the documents. For instance, if a verb topic has high probability words "fight struggle run threaten strike" and if this topic is associated with a particular section in a document, we can infer that perhaps a fight was described in that section of the document.

An even more ambitious project would be to use these part-of-speech topics in order to discover snippets of information, or more formally, logical assertions in first-order logic. For instance, perhaps we can use the "fight" topic above to retrieve all pairs of people who have been involved in a fight (as described in the text corpus), and we can also point to the exact document (and exact section) for each fight. Based on the text corpus, other interesting snippets of information can perhaps be discovered as well (e.g. who is friends with who?).

For other ideas relating to topic modeling, see the recent NIPS workshop on topic modeling.

Probabilistic modeling of music:

It would be interesting to try to model music in a probabilistic way. Perhaps one can apply topic modeling (see project above) to chord transitions or the musical notes themselves. See work by Matt Hoffman and Diane Hu for the current state-of-the-art in applying topic modeling to music. Data sets such as MIDI files for various songs should be freely available online. Using topic modeling, one can perhaps cluster together songs which are "musically" similar. Also, it would be great if the musical "topics" themselves were interpretable -- maybe one will be able to discover "melancholy" topics (perhaps minor chords), "nostalgaic" topics (perhaps major7 chords), and "upbeat" topics (perhaps standard major chords).

A probabilistic generative model is also useful in that it may be able to automatically generate novel chord progressions or melodies. Perhaps Markov models can be used to model and generate music. See Curtis Roads' textbook, Ch 19 for more information (we have a copy of this text should any team need it). Also see a Microsoft Research project that learns chords to accompany a melody. Finally, the International Society for Music Information Retrieval organizes conferences which feature research papers on the subject.

Automated analysis of computer code:

It would be interesting to automatically classify and cluster code files generated during software development. Topic modeling can also be used to automatically analyze code (see Erik Linstead's work). Perhaps other machine learning techniques can be used as well.

Yahoo's "Learning To Rank" challenge:

This is an information retrieval ranking competition sponsored by Yahoo, with $15,000 in prize money. In document retrieval (e.g. web search), a user specifies a "query", and the system (e.g. Yahoo) returns search results. It is of great importance to Yahoo and other companies to return results which are of high relevance to the user.

This competition provides a "training set" consisting of query-URL pairs, along with manually procured relevance scores (with 0 being irrelevant and 4 being perfectly relevant). Each of these pairs also has up to 700 potentially informative "features". Note that the queries, URLs, and features themselves are anonymized. Given a "test set" which gives query-URL pairs but not relevance scores, the goal is to predict what the relevance score will be (from 0 to 4) for each test pair.

Note that there are many different ranking algorithms available in the machine learning literature. The purpose of this project is to try to implement several different algorithms and see which algorithm (or combination of algorithms) does best. See the official web site for more details about this competition.

Recommender systems (collaborative filtering):

Collaborative filtering is the task of finding patterns in the data in order to make reasonable predictions about user preferences. The Netflix Prize challenge is one good example of a collaborative filtering task. In that competition, the data set consists of a matrix of 480,000 users by 17,000 movies, and each entry [u,v] in this matrix contains user u's rating (from 1 to 5) for movie v. This matrix is very sparse (i.e. most entries in the matrix are missing), and the goal of the competition is to try to predict the missing ratings. If one can accurately predict how well a user would like a particular movie, then one can produce an improved recommender system (e.g. for recommending movies). This collaborative filtering setup applies to many different domains (e.g. product recommendations from Amazon, Walmart, etc).

While the Netflix competition is already closed and the data is no longer available, a similar movie data set is available (MovieLens data), and one can perform a collaborative filtering project based on this data.

One can also perform a project on book ratings, using the Book-Crossing data set

Social network analysis:

As online social networks grow in popularity, social network analysis is becoming increasingly important. One important task is link prediction -- e.g. can we predict which people will become friends? For this project, one can use collegiate Facebook data over five different universities. Or one can try to find other social network data online.

An important task is determining which features would be most informative when doing link prediction. For instance, perhaps the probability of a friendship link is 99% if two people are in the same dorm and in the same major. The most informative features may be complicated functions of the attributes. The task is to find these features. One can perform exploratory data analysis and use Matlab to quickly create plots in order to discover correlations within the data. One can also try to find these features in an automated way. Also, one can use techniques from collaborative filtering in order to perform link prediction.

Time series analysis and anomaly detection:

Many data sets, such as financial data and other event data, have a temporal component to them. In particular, traffic data for the freeway ramps in Orange County is available on this web page -- that page also suggests several potential project ideas.

EEG data analysis:

Electroencephalograph (EEG) data can be captured by brain-computer interfaces (BCIs), which monitor electrical signals produced by neuron firings in one's brain. There is increasing interest in using machine learning techniques to classify EEG data. (Note a similar type of data is fMRI data). For instance, EEG data can be used to help those with neuromuscular impairments to control prosthetic devices. Furthermore, EEG can potentially be used to translate mental signals into words and thus perform synthetic telepathy.

A potential project is to apply machine learning algorithms to EEG data already available online (e.g. see this competition which seeks to classify EEG data into two different classes). Furthermore, there may be a slight possibility that we will be able to obtain a functional BCI headset to perform even more advanced experiments.

Plagiarism and vandalism detection:

An interesting potential project is to use machine learning algorithms to perform plagiarism or vandalism detection. See the plagiarism competition and the vandalism competition for more details.

Sports statistics and prediction:

The domain of professional sports (e.g. NFL, NBA, NHL, NCAA, etc.) is an interesting area to apply machine learning techniques. Specifically, classification and regression techniques can be used to predict many different quantities. For example, we might be interested in answering the following questions:

How many points will LeBron James score in the upcoming game against the Lakers, given that Ron Artest is healthy and will guard him?
Given data of previous games in the season and the fact that the Patriots will have home field advantage, will the Raiders beat the Patriots, and by what margin?
Based on historical data and regular season data, how accurately can we predict the NCAA basketball tournament brackets?
Based on college statistics, can we predict whether a draft pick will be a star player, a role player, or a bust?

In this project, teams will be expected to find sports statistics online -- perhaps there will be a need to write automated scripts (e.g. Perl) to scrape statistics from various online databases. Given real-world data, relevant prediction tasks can be proposed.

As an example, here is one research paper that uses statistics and machine learning to predict winners of March Madness. Also, the MIT Sloan Sports Analytics Conference may be of interest.

Also note an Australian Football League competition being run by Monash University in Australia.

KDD Cup:

Each year, the ACM Special Interest Group on Knowledge Discovery and Data Mining sponsors a competition known as the KDD Cup. Students are welcome to base their projects around one of these past competitions.

Update: The 2010 KDD Cup has just been announced!