Information Retrieval Assignments

INFX 141 / CS 121 • DAVID G. KAY • UC IRVINE • WINTER 2015

Assignment #4
Text Analysis and Indexing

In this assignment you will be exploring a corpus of Email messages about the Enron scandal. The corpus consists of the messages that were sent by or to Enron executives in the months preceding the scandal. It's a 400+MB file, tarred and gzipped, that you can download from https://www.cs.cmu.edu/~./enron/enron_mail_20110402.tgz. A description of the data is available at https://www.cs.cmu.edu/~./enron/. You are going to play the role of a data analyst hired to examine the evidence by doing some simple fact-finding and data processing.

General specifications:

You may do this assignment individually or in groups of 2 or 3. As before, the expectations of each size group are the same; shared labor is offset by communication and coordination costs. All group members receive the same score except in truly extraordinary circumstances.
We know there is great diversity in programming skill among class members. Moreover, we recognize that not everyone in the class is in a major or has a career goal that involves significant software development. Thus, Part 3 of this assignment has two alternative options that we'll call "Developer" and "Analyst." Your group may choose either. The Developer option is for students who are secure in their programming ability. The Analyst option is for those whose programming skills are less strong. We caution you that "Analyst" does not mean "easy," as you will see.
You may use Java, Python, or Scheme/Racket to write the tools you need for your analysis.
Use Piazza for general questions whose answers can benefit everybody.

Part 1. Context (20 points)

Find out about the Enron scandal, for example by finding documentaries on YouTube or text articles online. (One good one, called "The Smartest Guys in the Room," is reportedly available on Netflix.)

Create PDF document named context.pdf that includes:

A concise summary of the scandal (maximum half a page)
A list of each main player with their organizational affiliation, title (if any), and role in the scandal
The year in which the main events unfolded

Part 2. Quantify the data (40 points)

Unzip and untar the data file; if you don't know how to do this, find out. Then quantify the evidence by finding the answers to these questions (keeping notes on the steps you take):

Create a PDF file called quant.pdf that includes answers to those questions and, for each, a description of the process you followed to find those answers. If you used scripts or programs for this part, include them in a zipped folder called part2.
How many people are targeted in this data set? (We're just asking about the folder structure, not about the people mentioned in the emails themselves.)
How many individual data files are we dealing with?
How many messages were sent by these people in total? (Explain how you're interpreting the term "sent.")
How many messages were sitting in these people's Inboxes in total? (Explain how you're determining what counts as an "inbox.")
Who are the 10 people with the largest number of data files?

Part 3 [Developer]. Index the data (40 points)

Create inverted indices for the entire set of data files in the manner explained below. Here are some general notes about these indices:

For the purposes of this homework, create these indices as ASCII (.txt) files so the TA can read them directly.
Filter out email header words.
Filter out English stop words (see http://www.ranks.nl/resources/stopwords.html).
Use the Porter Stemming algorithm to "normalize"the terms (see http://tartarus.org/martin/PorterStemmer/).
Place each posting on a separate line.
Create each posting to follow this format (where \t is a tab character):
```
<term>[\t<doc>:<frequency>:<position>[,<position>]*]+
```
Order terms alphabetically in the index file.
Within each posting, order document lists alphabetically.

(Required) Create an index that uses human-readable terms and document identifiers. For example:
```
amendment   allen-p/_sent_mail/465.:1:34    stclair-c/sent/993.:5:45,60,76,84,100
```
Call this index file index_plain.txt.

(Extra credit, maximum 10 points) Create a second index that uses some encoding of, at least, terms and document identifiers in a way that decreases the size of the index file significantly (at least 20%). You may also compress the position information. Call this index file index_compressed.txt. [You can create this second index by parsing the files again or create it from the first index.]

Part 3 [Analyst]. Estimation (40 points)

Create a PDF document called estimation.pdf that includes responses to the following:

Find all documents that have the terms "conflict of interest", independent of capitalization. Explain what you did to get your answer.
Estimate how many distinct terms are in this data set. Explain the reasoning behind your estimate. If you used scripts or programs, explain what they do.
With respect to the index_plain.txt file described in Part 3 [Developer] above, estimate the size of that file in megabytes. State all of your assumptions and your calculations. You may use scripts, programs, or other tools to help create your estimate, but of course you must disclose what you used.

Submitting your assignment: Your will submit your work via Checkmate. For groups of two or three, just one of you should submit all parts of the assignment; the names of all group members must appear near the top of every submitted file.
First, submit your context.pdf file from Part 1.
Second, submit your quant.pdf file and your zipped part2 folder from Part 2.
Third, for Part 3 [Developer], submit your index_plain.txt file, optionally your index_compressed.txt file, a zipped folder called programs containing your program(s), and optionally (for those who do part 2) a fille called encoding.pdf that explains your encoding.
Third, for Part 3 [Analyst], submit your estimation.pdf file and optionally a zipped folder called programs containing any [non-mainstream, non-public] code you wrote or used for this part.

David G. Kay, kay@uci.edu
Sunday, February 15, 2015 2:55 PM