Computer Science 221: Information Retrieval
Winter 2009-2010
Department of Informatics
Donald Bren School of Information and Computer Sciences
Assignment 04
- Goals:
- This assignment is
designed to:
- Teach you how to make a postings list with a distributed MapReduce architecture on a BIG cluster.
- Capture the data that you need to calculate ranking scores for assignment 05.
- Administration:
- You may work in teams of 1, 2 or 3, no restrictions on membership.
- Mitch will be managing the cluster on Amazon for us.
- This is going to require some instructions on how to upload and run jobs.
- Some details.
- Write a Java Program To Be Executed by a >50 node Hadoop cluster
- The input to the system will be a list of <key,value> pairs in a file.
- The "key" will be a URL in Wikipedia
- The "value" will be the document ID of that URL
- The posting list will be on the distributed file system.
- There is no guarantee that the URLs are valid.
- What to turn in:
- Submit a sorted postings.txt file in which each line contains postings of a term:
- A term is defined as follows:
- Don't consider text in the HTML tags.
- Replace all non-ASCII characters with spaces.
- Make all the characters lowercase.
- All terms are separated by white space.
- All punctuation should be removed from the beginning and end of a term, but not from the middle.
- No term should be longer than 32 characters. If a term is longer, truncate it to the first 32 characters.
- Examples:
- "However", "however", "however," should all map to "however"
- "Dogs", "dogs", and "dOgS" should all map to "dogs"
- "Dog's", "dog's", and "dog'S" should all map to "dog's"
- "jack-o-lantern." should map to "jack-o-lantern"
- "他⃞四⃞六⃞40ㅈ20ㅁ" should map to two terms "40" and "20"
- "1>alphabet" should map to "1>alphabet" (this is dumb, but for uniformity, we'll stick to the format)
- "010101010101010--010011101010101actgtgtacgatcgtagctggtagctcgtagcta--010100" should map to "010101010101010--010011101010101"
- The form of the posting list should be:
- [term, cf, df] [docid1, tf1] [docid2, tf2] ....
- where cf is the collection frequency of the term, df is the document frequency of term
- tf1 is the term frequency of "term" in document with docid1 and so on.
- Before uploading the posting list, truncate it and only keep the LAST 1000 lines.
- Submit a second report with this content:
- Total number of distinct terms.
- List of top 10 words with lowest document frequency.
- List of top 10 words with highest document frequency.
- First Name of member 1 of your team: ...
- List of (at most 1000) URLs containing first name of member 1 of your team: ...
- If there are no pages with your team members name, use "Donald" instead and tell me you did that.
- Submitting your assignment
- Turn your report into the dropbox created for this assignment on EEE.
- Please make two pdf documents
- Make the file names:
- <StudentID>-<StudentIID>-Assignment04-Report.pdf
- <StudentID>-<StudentIID>-Assignment04-Posting.pdf
- Please include the full names of all of your group members as appropriate in the document