Each student will do an individual project. There are typically a few different types of projects that students do:
1.
Data-Focused:
Analyze a
specific data set in detail, using whatever data mining tools and
software you can find that are available to you (e.g., there is alot of
freeware software on the Web for research in data mining and data
analysis). The emphasis here will be on
the data, and you need not necessarily implement your own algorithms.
Some possible data sets for you to work on are provided at the bottom
of this Web page. An example of this type of project would be to
analyze the Enron email data set.
2.
Algorithm
Development:
Develop a new data mining algorithm that is tailored specifically to
one of the
project data sets. This requires that you understand the data in fair
detail,
but perhaps do not explore all aspects of the data. You see an
opportunity to
look for some interesting information in the data, but there is no
known
algorithm that you know of that does this, so you implement your own.
You will
also need to compare to at least one baseline technique to validate
your
approach. An example of such a project is the change detection using
vegetation image data below, or the freeway traffic prediction problem.
3.
Empirical Study: take some
well-known published data mining algorithms and evaluate their
performance in detail and systematically on a number of different data
sets. An example below is a systematic comparison of well-known
state-of-the-art classifiers on multiple data sets.
Data Sets, Algorithms, and Task Selection: due Thursday April 20th. Required: clear description of the task, assessment of what will be required to complete the task (outline of algorithms needed, whether you will implement them yourself or use existing software), initial brief literature review, brief plan of how you will evaluate the results.
Midterm progress report 1: due
Thursday May 4th. Required: results of exploratory data analysis,
results of data checking for anomalies and obvious errors, results of
in-depth literature review, revision (if needed) of any aspects of the
plan at this point.
Midterm progress report 2: due
Thursday May 18th Required: preliminary results and
assessment.
Student project presentations:
June 6th and 8th in class
Final project report: due Monday June 12th.
Empirical comparison of
"state-of-the-art" classifiers: there are relatively few published
studies of systematic comparisons of current competing "state of the
art" classification methods (such as logistic regression, boosting,
SVMs, and decision trees). Your project would be to develop (or
obtain) implementations of the algorithms in code, decide on a set of
benchmark data sets to use for your project, and then conduct a series
of experiments that compare the different classifiers across the
different data sets. The "null hypothesis" would be that they all
perform similarly and you would test for systematic differences in
performance. You would need to become knowledgeable in the details of
how each classifier operates.
Develop an algorithm for summarizing
and visualizing how nodes are related in large networks: using the
CiteSeer coauthor network as an example, or any other large network
(e.g., order of 50,000 or more nodes), your goal is to develop an
algorithm that allows a user to select any 2 (or more) nodes and in
"real-time" (or close to it) compute and display information (e.g., in
the form of a subgraph) that summarizes how the selected nodes are
related. Obvious ideas such as shortest paths are relevant, but you
will also need to explore more informative ways of characterizing how
nodes are related, for example: see the papers by Mark Newman on
"betweenness" calculations, or the paper on relative importance in
networks (White and Smyth, 2003).
Learning how networks evolve over
time: again for large networks, but now where the edges have
time-stamps on them, e.g., the year a paper was written for a co-author
network, or the time down to the seconds when person A sent an email to
a set of people, or the day on which a hyperlink was added to a Web
page. The idea here is to try learn some models for how these networks
are evolving over time - are there clusters that new nodes get "drawn"
into? are there nodes that move around the network making new contacts
and losing old ones? and so on. You could evaluate your algorithm by
seeing how well it predicts future properties of the network (unseen by
the model) given the historical data. Possible papers of relevance here
include Barabasi and Albert, Science, 1999 (see also later references
to this paper) and Leskovic, Kleinberg, and Faloutsos (ACM SIGKDD,
2005).
Investigate
the Enron email corpus (or
another text corpus) using statistical topic models: The
topic modeling algorithm is available in MATLAB from Professor Mark
Steyver's Web page. You could use the topic modeling algorithm to
automatically assign topics to each of the emails in the Enron
corpus. You could then look at (for example) how topics evolve/change
over time: do some topics persist through the whole time-period? do
some topics come and go? so some topics "recur" frequently? you could
also look at "network effects" of different topics: do some topics tend
to involve more people in the email discussion? do some topics
typically have longer threads? There are numerous other aspects of this
data that would be worth investigating.
Detecting
changes in global vegetation images over 20 years: Use the
global 20-year vegetation data set (the NDVI data) to develop an
algorithm that can detect how spatially "patches" of vegetation are
changing over the 20-year period. For example, if you look at the
global images over a 1-year period, one can easily see the seasonal
changes in vegetation in different parts of the planet. You could
develop image segmentation algorithms to calculate the boundaries of
vegetation regions in different contintents (e.g., grasslands in
Africa), track these boundaries over time, and then detect changes
across years (sudden differences in size, shape, location, or
gradual trends). Potentially relevant papers include Zhang XY, Friedl, MA, Schaaf CB, et al.,
Monitoring vegetation phenology using MODIS, Remote Sensing of the
Environment, 2003, and Zhang, Friedl, et al, Climate controls on
vegetation phenological patterns in northern mid- and high latitudes
inferred from MODIS data, Global Change Biology, 2004. Contact
Lucas Scharenbroich, lscharen@uci.edu, to get get a copy of the 20-year
vegetation data.
Logistic
regression with missing data: logistic regression is a widely
used technique in practical applications of data mining (e.g., in
credit scoring, etc.). There is relatively little work published on how
to use logistic regression when some of the inputs values are missing
in the training data and in the test data. Survey the literature on
logistic regression and missing data and implement and evaluate
different techniques for handling missing data in logistic regression.
Systematically test these approaches on several well-known data sets,
where for example you can simulate the effect of different amounts of
missing data by randomly deleting different fractions of the input
values - you can then look at how the accuracy of the learned
model degrades as the fraction of missing data increases.
Information
extraction from text data: The RESCUE project at UCI is a large
NSF research project investigating how information technology can be
used in disaster response situations (e.g., hurricances, earthquakes).
One aspect of this work is in automatically extracting information from
text data, e.g., names of people and places, detecting mentions of
events, linking text to geographical locations, etc. There are a number
of possible projects in this area. If interested, please contact Dr.
Naveen Ashish at ashish@ics.uci.edu, and he can provide more details on
possible project topics.
Predicting
freeway traffic patterns: use the freeway traffic data (below)
to build models that can predict freeway conditions at time t + k,
where t is the current time and k varies say from 15 minutes to 30
minutes to 2 hours. The idea is to combine historical data with current
conditions to come up with a best prediction. There is considerable
spatial information in the data (from multiple sensors) so you could
think of the problem as extrapolating in time a set of "space-time
images" of freeway conditions. Each image has spatial position (alonb
the freeway) on the y-axis, time on the x-axis, and at any space-time
point is a pixel that is color-coded by the speed or traffic density at
that time at that location. In such images one can see how "waves" of
traffic (e.g., during rush hour) evolve in space and time. Your goal
would be to (in effect) predict how these traffic waves evolve, taking
into account that accidents and other indicents can significantly
change the overall behavior. You would evaluate your methods in terms
of how well they can predict on unseen future data and systematically
compare them to baseline methods such as predicting that there will be
no change at all, or that the traffic at time t+k will be the same at
time t+k yesterday or last week. You may want to look at the PEMS Web
pages at UC Berkeley for general information on traffic data analysis
and various technical papers.
Clustering
Web users and/or predicting their navigation paths: use the ICS
Web log data (below), or any other Web log data (e/g., the MSNBC Web
log data used in the Cadez, Heckerman, et al, 2003 paper), to develop
an algorithm to better understand how users traverse a Web site, e.g.,
develop an extension to the Markov clustering techniques described in
Cadez et al (2003) or develop an algorithm that can try to predict what
future pages a user will visit (note that a simple baseline here of
just predicting that a user will follow the most travelled hyperlink
from the current Web page, and so on, can be hard to beat in terms of
accuracy).
These are just a few sample ideas to get
you started. Feel free to also propose your own ideas if you prefer.
Below are some links to sample data sets that you are free to use in
projects. These data sets are just suggestions - you are free to
propose to use other data sets.
Web navigation data, from
ICS Web logs.
CiteSeer digital library data, containing
both
- text from
a
large number of computer science abstracts,
with dates,
years, citations, etc.
- a large
co-author network and a co-citation network
derived from the CiteSeer digital library data
Enron email
data set - approximately
250,000 emails obtained from Enron as part of the Department of Justice
legal inquiry against the company.
US demographic data at
the ZipCode level
Microarray
gene expression data:
well-known data set on expression data from Yeast genes over time. Note
that there are many other microarray data sets available, this is just
one of the more well-known data sets.
IMPORTANT! Some of these data sets are only to be used for an ICS 278 project. Specifically, for data sets that are not available from public Web sites, you are not allowed to redistribute these to anyone else outside the class, or to use them after the class is over. If you would like to get an exception to this policy please contact me directly with your request. The reason we have these constraints is because some of the data sets are not public domain.