Using a MapReduce Architecture to make a posting list

  • Goals:
    1. To teach you how to make a posting list with a distributed MapReduce architecture on a BIG cluster.
  • Groups: This can be done in groups of 1, 2 or 3.
  • Reusing code: This assignment will probably build on your task 29 code. You are welcome to use Hadoop examples as the basis for this project. Use code found over the Internet at your own peril -- it may not do exactly what the assignment requests. If you do end up using code you find on the Internet, you must disclose the origin of the code. Concealing the origin of a piece of code is plagiarism.
  • Discussion: Use the Message Board for general questions whose answers can benefit you and everyone.
  • Write a program to be executed by Hadoop:
    • Input:
      • on the distributed file system there will be files in a directory called /user/common/large. This is the result of the full crawl of flatricidepulgamitudepedia that Prof. Patterson did. It has a bunch of garbage words eliminated and about 5% of the crawl didn't complete due to network errors. But the rest is there.
      • The format of the file will be <doc_id>:<word>:<word_count> One entry per line. Multiple entries with the same (doc_id,word) pair may be present. They should be summed in your deliverable.
    • Your work:
      • Make a posting list with an alphabetized list of words found in your corpus, the documents in which they occur and their frequency in each document. Alphabetize according to native Java sorting functions.
  • Guides
  • Submitting your assignment
    1. Each individual, even if you are working in a group, will take a quiz which asks you to answer 10 questions about the posting list. Your group-mates can help you, but they are not required to. Each person should be able to run the required programs and work with the required data individually.
    2. The quiz will be administered via the EEE quiz mechanism.
    3. The questions will be of the form:
      • What is the document frequency of word X?
      • What is the term frequency of word X in document Y?
  • Evaluation:
    1. Correctness: Did you get the right answer?
    2. Due date: 03/11 11:59pm
    3. This is an assigment grade