INFX 141 / CS 121 • DAVID G. KAY • UC IRVINE • WINTER 2015
Text Analysis and Indexing
In this assignment you will be exploring a corpus of Email messages about the Enron scandal. The corpus consists of the messages that were sent by or to Enron executives in the months preceding the scandal. It's a 400+MB file, tarred and gzipped, that you can download from
https://www.cs.cmu.edu/~./enron/enron_mail_20110402.tgz. A description of the data is available at
https://www.cs.cmu.edu/~./enron/. You are going to play the role of a data analyst hired to examine the evidence by doing some simple fact-finding and data processing.
Part 1. Context (20 points)
Find out about the Enron scandal, for example by finding documentaries on YouTube or text articles online. (One good one, called "The Smartest Guys in the Room," is reportedly available on Netflix.)
Create PDF document named
context.pdf that includes:
Part 2. Quantify the data (40 points)
Unzip and untar the data file; if you don't know how to do this, find out. Then quantify the evidence by finding the answers to these questions (keeping notes on the steps you take):
quant.pdfthat includes answers to those questions and, for each, a description of the process you followed to find those answers. If you used scripts or programs for this part, include them in a zipped folder called
Part 3 [Developer]. Index the data (40 points)
Create inverted indices for the entire set of data files in the manner explained below. Here are some general notes about these indices:
.txt) files so the TA can read them directly.
\tis a tab character):
Call this index file
amendment allen-p/_sent_mail/465.:1:34 stclair-c/sent/993.:5:45,60,76,84,100
index_compressed.txt. [You can create this second index by parsing the files again or create it from the first index.]
Part 3 [Analyst]. Estimation (40 points)
Create a PDF document called
estimation.pdf that includes responses to the following:
index_plain.txtfile described in Part 3 [Developer] above, estimate the size of that file in megabytes. State all of your assumptions and your calculations. You may use scripts, programs, or other tools to help create your estimate, but of course you must disclose what you used.
Submitting your assignment: Your will submit your work via Checkmate. For groups of two or three, just one of you should submit all parts of the assignment; the names of all group members must appear near the top of every submitted file.
First, submit your
context.pdf file from Part 1.
Second, submit your
quant.pdf file and your zipped
part2 folder from Part 2.
Third, for Part 3 [Developer], submit your
index_plain.txt file, optionally your
index_compressed.txt file, a zipped folder called
programs containing your program(s),
and optionally (for those who do part 2) a fille called
encoding.pdf that explains your encoding.
Third, for Part 3 [Analyst], submit your
estimation.pdf file and optionally a zipped folder called
programs containing any [non-mainstream, non-public] code you wrote or used for this part.