INFX 141 / CS 121 • DAVID G. KAY • UC IRVINE • WINTER 2015
Assignment #4
Text Analysis and Indexing
In this assignment you will be exploring a corpus of Email messages about the Enron scandal. The corpus consists of the messages that were sent by or to Enron executives in the months preceding the scandal. It's a 400+MB file, tarred and gzipped, that you can download from https://www.cs.cmu.edu/~./enron/enron_mail_20110402.tgz
. A description of the data is available at https://www.cs.cmu.edu/~./enron/
. You are going to play the role of a data analyst hired to examine the evidence by doing some simple fact-finding and data processing.
General specifications:
Part 1. Context (20 points)
Find out about the Enron scandal, for example by finding documentaries on YouTube or text articles online. (One good one, called "The Smartest Guys in the Room," is reportedly available on Netflix.)
Create PDF document named context.pdf
that includes:
Part 2. Quantify the data (40 points)
Unzip and untar the data file; if you don't know how to do this, find out. Then quantify the evidence by finding the answers to these questions (keeping notes on the steps you take):
quant.pdf
that includes answers to those questions and, for each, a description of the process you followed to find those answers. If you used scripts or programs for this part, include them in a zipped folder called part2
.Part 3 [Developer]. Index the data (40 points)
Create inverted indices for the entire set of data files in the manner explained below. Here are some general notes about these indices:
.txt
) files so the TA can read them directly.
http://www.ranks.nl/resources/stopwords.html
).
http://tartarus.org/martin/PorterStemmer/
).
\t
is a tab character):
<term>[\t
<doc>:
<frequency>:
<position>[,
<position>]*]+
amendment allen-p/_sent_mail/465.:1:34 stclair-c/sent/993.:5:45,60,76,84,100
Call this index file index_plain.txt
.index_compressed.txt
. [You can create this second index by parsing the files again or create it from the first index.]
Part 3 [Analyst]. Estimation (40 points)
Create a PDF document called estimation.pdf
that includes responses to the following:
index_plain.txt
file described in Part 3 [Developer] above, estimate the size of that file in megabytes. State all of your assumptions and your calculations. You may use scripts, programs, or other tools to help create your estimate, but of course you must disclose what you used.Submitting your assignment: Your will submit your work via Checkmate. For groups of two or three, just one of you should submit all parts of the assignment; the names of all group members must appear near the top of every submitted file.
First, submit your context.pdf
file from Part 1.
Second, submit your quant.pdf
file and your zipped part2
folder from Part 2.
Third, for Part 3 [Developer], submit your index_plain.txt
file, optionally your index_compressed.txt
file, a zipped folder called programs
containing your program(s),
and optionally (for those who do part 2) a fille called encoding.pdf
that explains your encoding.
Third, for Part 3 [Analyst], submit your estimation.pdf
file and optionally a zipped folder called programs
containing any [non-mainstream, non-public] code you wrote or used for this part.