Using a MapReduce Architecture to make a posting list
- Goals:
- To teach you how to make a posting list with a distributed MapReduce architecture on a BIG cluster.
- Groups: This can be done in groups of 1, 2 or 3.
- Reusing code: This assignment
will probably build on your task 29 code. You are welcome to use Hadoop examples as the basis
for this project.
Use code found over the Internet at your own peril -- it may not do exactly what the assignment requests. If you do end up using code you find on the Internet, you must disclose the origin of the code. Concealing the origin of a piece of code is plagiarism.
- Discussion: Use the Message Board for general questions whose answers can benefit you and everyone.
- Write a program to be executed by Hadoop:
- Input:
- on the distributed file system
there will be files in a directory
called /user/common/large. This is
the result of the full crawl of
flatricidepulgamitudepedia that
Prof. Patterson did. It has a bunch
of garbage words eliminated and
about 5% of the crawl didn't
complete due to network errors. But
the rest is there.
- The format of the file will be
<doc_id>:<word>:<word_count>
One entry per line. Multiple
entries with the same (doc_id,word)
pair may be present. They should be
summed in your deliverable.
- Your work:
- Make a posting list with an alphabetized list of
words found in your corpus, the
documents in which they occur and
their frequency in each document.
Alphabetize according to native Java
sorting functions.
- Guides
- Submitting your assignment
- Each individual, even if you are working
in a group, will take a quiz which asks you
to answer 10 questions about the posting
list. Your group-mates can help you, but
they are not required to. Each person
should be able to run the required programs
and work with the required data
individually.
- The quiz will be administered via the EEE quiz mechanism.
- The questions will be of the form:
- What is the document frequency of word X?
- What is the term frequency of word X in document Y?
- Evaluation:
- Correctness: Did you get the right answer?
- Due date: 03/11 11:59pm
- This is an assigment grade