Guidelines for Class Projects, ICS 278, Data Mining

Spring 2006

 

General Guidelines

Each student will do an individual project. There are typically a few different types of projects that students do:

1.      Data-Focused: Analyze a specific data set in detail, using whatever data mining tools and software you can find that are available to you (e.g., there is alot of freeware software on the Web for research in data mining and data analysis). The emphasis here will be on the data, and you need not necessarily implement your own algorithms. Some possible data sets for you to work on are provided at the bottom of this Web page. An example of this type of project would be to analyze the Enron email data set.

2.      Algorithm Development: Develop a new data mining algorithm that is tailored specifically to one of the project data sets. This requires that you understand the data in fair detail, but perhaps do not explore all aspects of the data. You see an opportunity to look for some interesting information in the data, but there is no known algorithm that you know of that does this, so you implement your own. You will also need to compare to at least one baseline technique to validate your approach. An example of such a project is the change detection using vegetation image data below, or the freeway traffic prediction problem.

3.   Empirical Study: take some well-known published data mining algorithms and evaluate their performance in detail and systematically on a number of different data sets. An example below is a systematic comparison of well-known state-of-the-art classifiers on multiple data sets.


Reporting Requirements

Your grade for your project will be based on both interim reports during the quarter that describe your progress so far, as well as a comprehensive final project report and a short class presentation of your work. More details on the specific requirements for each report will be made available as the quarter progresses:

Data Sets, Algorithms, and Task Selection: due Thursday April 20th. Required: clear description of the task, assessment of what will be required to complete the task (outline of algorithms needed, whether you will implement them yourself or use existing software), initial brief literature review, brief plan of how you will evaluate the results. 

Midterm progress report 1: due Thursday May 4th. Required: results of exploratory data analysis, results of data checking for anomalies and obvious errors, results of in-depth literature review, revision (if needed) of any aspects of the plan at this point.

Midterm progress report 2: due Thursday May 18th  Required: preliminary results and assessment.  

Student project presentations: June 6th and 8th in class

Final project report: due Monday June 12th.

   

Possible ideas for Projects

Below are some possible ideas for project topics. These are only suggestions. Please feel free to come up with your own ideas for projects.

Empirical comparison of "state-of-the-art" classifiers: there are relatively few published studies of systematic comparisons of current competing "state of the art" classification methods (such as logistic regression, boosting, SVMs, and decision trees).  Your project would be to develop (or obtain) implementations of the algorithms in code, decide on a set of benchmark data sets to use for your project, and then conduct a series of experiments that compare the different classifiers across the different data sets. The "null hypothesis" would be that they all perform similarly and you would test for systematic differences in performance. You would need to become knowledgeable in the details of how each classifier operates.

Develop an algorithm for summarizing and visualizing how nodes are related in large networks: using the CiteSeer coauthor network as an example, or any other large network (e.g., order of 50,000 or more nodes), your goal is to develop an algorithm that allows a user to select any 2 (or more) nodes and in "real-time" (or close to it) compute and display information (e.g., in the form of a subgraph) that summarizes how the selected nodes are related. Obvious ideas such as shortest paths are relevant, but you will also need to explore more informative ways of characterizing how nodes are related, for example: see the papers by Mark Newman on "betweenness" calculations, or the paper on relative importance in networks (White and Smyth, 2003).

Learning how networks evolve over time: again for large networks, but now where the edges have time-stamps on them, e.g., the year a paper was written for a co-author network, or the time down to the seconds when person A sent an email to a set of people, or the day on which a hyperlink was added to a Web page. The idea here is to try learn some models for how these networks are evolving over time - are there clusters that new nodes get "drawn" into? are there nodes that move around the network making new contacts and losing old ones? and so on. You could evaluate your algorithm by seeing how well it predicts future properties of the network (unseen by the model) given the historical data. Possible papers of relevance here include Barabasi and Albert, Science, 1999 (see also later references to this paper) and Leskovic, Kleinberg, and Faloutsos (ACM SIGKDD, 2005).

Investigate the Enron email corpus (or another text corpus) using statistical topic models:  The topic modeling algorithm is available in MATLAB from Professor Mark Steyver's Web page. You could use the topic modeling algorithm to automatically assign  topics to each of the emails in the Enron corpus. You could then look at (for example) how topics evolve/change over time: do some topics persist through the whole time-period? do some topics come and go? so some topics "recur" frequently? you could also look at "network effects" of different topics: do some topics tend to involve more people in the email discussion? do some topics typically have longer threads? There are numerous other aspects of this data that would be worth investigating.

Detecting changes in global vegetation images over 20 years: Use the global 20-year vegetation data set (the NDVI data) to develop an algorithm that can detect how spatially "patches" of vegetation are changing over the 20-year period. For example, if you look at the global images over a 1-year period, one can easily see the seasonal changes in vegetation in different parts of the planet. You could develop image segmentation algorithms to calculate the boundaries of vegetation regions in different contintents (e.g., grasslands in Africa), track these boundaries over time, and then detect changes across years  (sudden differences in size, shape, location, or gradual trends).  Potentially relevant papers include Zhang XY, Friedl, MA, Schaaf CB, et al., Monitoring vegetation phenology using MODIS, Remote Sensing of the Environment, 2003, and Zhang, Friedl, et al, Climate controls on vegetation phenological patterns in northern mid- and high latitudes inferred from MODIS data, Global Change Biology, 2004. Contact Lucas Scharenbroich, lscharen@uci.edu, to get get a copy of the 20-year vegetation data.

Logistic regression with missing data: logistic regression is a widely used technique in practical applications of data mining (e.g., in credit scoring, etc.). There is relatively little work published on how to use logistic regression when some of the inputs values are missing in the training data and in the test data. Survey the literature on logistic regression and missing data and implement and evaluate different techniques for handling missing data in logistic regression. Systematically test these approaches on several well-known data sets, where for example you can simulate the effect of different amounts of missing data by randomly deleting different fractions of the input values  - you can then look at how the accuracy of the learned model degrades as the fraction of missing data increases.

Information extraction from text data: The RESCUE project at UCI is a large NSF research project investigating how information technology can be used in disaster response situations (e.g., hurricances, earthquakes). One aspect of this work is in automatically extracting information from text data, e.g., names of people and places, detecting mentions of events, linking text to geographical locations, etc. There are a number of possible projects in this area. If interested, please contact Dr. Naveen Ashish at ashish@ics.uci.edu, and he can provide more details on possible project topics.

Predicting freeway traffic patterns: use the freeway traffic data (below) to build models that can predict freeway conditions at time t + k, where t is the current time and k varies say from 15 minutes to 30 minutes to 2 hours. The idea is to combine historical data with current conditions to come up with a best prediction. There is considerable spatial information in the data (from multiple sensors) so you could think of the problem as extrapolating in time a set of "space-time images" of freeway conditions. Each image has spatial position (alonb the freeway) on the y-axis, time on the x-axis, and at any space-time point is a pixel that is color-coded by the speed or traffic density at that time at that location. In such images one can see how "waves" of traffic (e.g., during rush hour) evolve in space and time. Your goal would be to (in effect) predict how these traffic waves evolve, taking into account that accidents and other indicents can significantly change the overall behavior. You would evaluate your methods in terms of how well they can predict on unseen future data and systematically compare them to baseline methods such as predicting that there will be no change at all, or that the traffic at time t+k will be the same at time t+k yesterday or last week. You may want to look at the PEMS Web pages at UC Berkeley for general information on traffic data analysis and various technical papers.

Clustering Web users and/or predicting their navigation paths: use the ICS Web log data (below), or any other Web log data (e/g., the MSNBC Web log data used in the Cadez, Heckerman, et al, 2003 paper), to develop an algorithm to better understand how users traverse a Web site, e.g., develop an extension to the Markov clustering techniques described in Cadez et al (2003) or develop an algorithm that can try to predict what future pages a user will visit (note that a simple baseline here of just predicting that a user will follow the most travelled hyperlink from the current Web page, and so on, can be hard to beat in terms of accuracy).


These are just a few sample ideas to get you started. Feel free to also propose your own ideas if you prefer.

 

Possible Project Data Sets

Below are some links to sample data sets that you are free to use in projects.  These data sets are just suggestions - you are free to propose to use other data sets.

Web navigation data, from ICS Web logs.

CiteSeer digital library data, containing both
        -   text from a large number of computer science abstracts, with dates, years, citations, etc.

        -   a large co-author network and a co-citation network derived from the CiteSeer digital library data

Enron email data set - approximately 250,000 emails obtained from Enron as part of the Department of Justice legal inquiry against the company.

US demographic data at the ZipCode level

Freeway traffic sensor data

Microarray gene expression data:  well-known data set on expression data from Yeast genes over time. Note that there are many other microarray data sets available, this is just one of the more well-known data sets.

IMPORTANT! Some of these data sets are only to be used for an ICS 278 project. Specifically, for data sets that are not available from public Web sites, you are not allowed to redistribute these to anyone else outside the class, or to use them after the class is over. If you would like to get an exception to this policy please contact me directly with your request. The reason we have these constraints is because some of the data sets are not public domain.