Informatics 141: Computer Science 121: Information Retrieval

Task 27

Using a MapReduce Architecture on a simple task

Goals:
1. To teach you how to program for a MapReduce architecture.
2. To teach you how to use Hadoop before taking on a bigger scale task.
Groups: This is a solo assignment.
Reusing code: This assignment shouldn't require you to reuse any of your own code. You are welcome to use Hadoop examples as the basis for this project. Use code found over the Internet at your own peril -- it may not do exactly what the assignment requests. If you do end up using code you find on the Internet, you must disclose the origin of the code. Concealing the origin of a piece of code is plagiarism.
Discussion: Use the Message Board for general questions whose answers can benefit you and everyone.

Write a program to be executed by Hadoop:

Input:

If the last number of your student id is X, then your input is Y:

X	Y
0	http://www.flatricidepulgamitudepedia.org/gutenberg/2/6/0/3/26030/26030-8.txt
1	http://www.flatricidepulgamitudepedia.org/gutenberg/2/6/0/3/26031/26031-8.txt
2	http://www.flatricidepulgamitudepedia.org/gutenberg/2/6/5/6/26565/26565.txt
3	http://www.flatricidepulgamitudepedia.org/gutenberg/2/6/5/6/26566/26566-8.txt
4	http://www.flatricidepulgamitudepedia.org/gutenberg/2/7/2/8/27280/27280-8.txt
5	http://www.flatricidepulgamitudepedia.org/gutenberg//2/8/8/0/28804/28804-8.txt
6	http://www.flatricidepulgamitudepedia.org/gutenberg//2/8/8/0/28805/28805-8.txt
7	http://www.flatricidepulgamitudepedia.org/gutenberg/2/8/8/1/28812/28812-8.txt
8	http://www.flatricidepulgamitudepedia.org/gutenberg/2/8/8/1/28813/28813-8.txt
9	http://www.flatricidepulgamitudepedia.org/gutenberg/2/8/8/1/28819/28819-8.txt

Output:
- A text files with an alphabetized list of characters found in your document and their frequency count. Please count all valid Unicode characters. Alphabetize according to native Java sorting functions.

One way to do this (there are multiple ways to do this, perhaps some that are more efficient):
- Create a sample MapReduce .jar
  - Create a fresh Eclipse workspace
  - Download the source code for hadoop, hadoop-2.2.0-src.tar.gz, and the binary distribution for hadoop, hadoop-2.2.0.tar.gz
  - Import a new "Maven" project from the source code download. (If that's not an option import the m2e software into Eclipse)
  - Create a new Java project called "CharCount"
  - Copy the code from hadoop-mapreduce-examples/src/main/java/org.apache.hadoop.examples/WordCount.java to CharCount.java in your new project
  - Add the projects "hadoop-common" and "hadoop-mapreduce-client-core" to your new project
  - Make any other fixes you need to make so that CharCount compiles, but still counts words. (despite the name)
  - Try and run CharCount as a Java Application. It will indicate an error message: "Usage..." and quit, but this will create a run configuration
  - (Updated 2/17/14): Export the project as a runnable jar using the run configuration that you just implicitly created, but for library handling pick "Copy required libraries...". We are going to use the libraries that are on the cluster to avoid conflicts. Keep track of where the jar goes.
- Put everything in the right place
  - Obtain an input file whose words you are going to count
  - Move your .jar, your input file, and the hadoop binary to openlab.
  - login to openlab
  - uncompress the hadoop binary file. If the binary ends in ".tar.gz" you can use the command "tar xvofz <file_name>" to do that
  - edit the file etc/hadoop/core-site.xml so that the location of the distributed filesystem is exposed by adding this xml property: <property> <name>fs.defaultFS</name> <value>hdfs://francis-griffin.ics.uci.edu:8020/</value> </property>
  - Look at see what is in the distributed file system with the command:bin/hdfs dfs -ls /
    - (Update 2/17/14): If the warnings annoy you you can replace the file lib/native/libhadoop.so.1.0.0 with the one located here. I built that one natively on openlab.
  - Make your own directory in the distributed file system with the command:bin/hdfs dfs -mkdir /user/<user_name>
  - Look at see what is in your distributed file system directory with the command:bin/hdfs dfs -ls /user/<user_name>
  - You can also see what's in the distributed file system by looking at the filesystem in a web browser
  - Make your own input directory in the distributed file system with the command:bin/hdfs dfs -mkdir /user/<user_name>/input
  - Move your input file from openlab into the distributed file system using this command:bin/hdfs dfs -copyFromLocal input.txt /user/<user_name>/input
  - Check to make sure the file arrived there by using the -ls command or the filesystem browser
  - If you need to delete a file use the command -rm instead of -mkdir or -ls
- Now run your sample program
  - bin/hadoop jar CharCount.jar /user/<user_name>/input /user/<user_name>/output
  - If everything went okay, then your answer should be in the output directory. View it in the filebrowser
- Now fix your sample program
  - Change your code to count characters and do this again.
  - Here is what Prof Patterson got when couting characters on the test input file linked above. test output
Guides
- YouTube video guide
Submitting your assignment
1. We are going to use checkmate.ics.uci.edu to submit this assignment.
2. Make the file name Task27.txt
3. Your submission should be a single txt file submitted. At the top should be your name, student id and any information that you want to give us about the assignment. Then an alphabetized list, one character per line, with frequencies of your document.
Evaluation:
1. Correctness: Did you get the right answer?
2. Due date: 02/21 11:59pm
3. This is an assigment grade