Computer Science 221: Information Retrieval
Winter 2009-2010
Department of Informatics
Donald Bren School of Information and Computer Sciences
Assignment 03
- Goals
- This assignment is
designed to:
- To teach you how to make a postings list with a MapReduce architecture.
- To teach you how to use Hadoop before trying it on a Wikipedia scale assignment.
- Administration:
- You may work in teams of 1, 2 or 3, but it cannot be the same group as a previous assignment.
- Write a Java Program To Be Executed by Hadoop
- The input to the system will be a list of URL and DocID pairs in a file.
- Here is the input file.
- The output of the system will be a posting list alphabetized by terms found in the input URLS
- A text file with the term on the start of the line and a list of all document URLs on which that term appears.
- Write a Java Program To Be Post Process the results
- This program will calculate the statistics required for the document that you turn in.
- How to do this:
- Download Hadoop
- Install it locally on an openlab machine
- Run it as a single-node cluster
- Here are some tutorials
- Follow the instructions here ("Picking your java version on openlab") to get the right version of java.
- "Hadoop Map/Reduce Tutorial"
- "Running Hadoop On Ubuntu Linux (Single-Node Cluster)"
- Hints:
- The lines of input.txt file are passed to your mapper function one by one.
- It should download the page of that URL, make all the text lower cased, and generate <term, docid> pairs.
- (Check Downloader.java as a sample of how to use crawler4j lib to download a single page).
- Here is an example for WordCount.
- Test
- With this input.
- Don got this output.
- What to turn in:
- Outputs:
- Submit a postings.txt file in which each line contains postings of a term: <term, docid1, docid2, ...>. Before uploading this file, truncate it and only keep the first 500 lines.
- Submit a report.txt file with this content:
- Total number of distinct terms: ....
- Number of documents containing term 'is': ...
- Number of documents containing term 'satellite': ...
- Number of documents containing the first name of member 1 of your team: ...
- List of URLs containing first name of member 1 of your team: ...
- Submitting your assignment
- Turn your report into the dropbox created for this assignment on EEE.
- Make the file name <StudentID>-<StudentIID>-Assignment03.pdf
- Please include the full names of all of your group members as appropriate in the document