Information Retrieval Assignments

INFX 141 / CS 121 • DAVID G. KAY • UC IRVINE • WINTER 2015

Assignment #3
Crawling

Goal: Using the specified library and program skeleton, write a program to crawl the domain www.ics.uci.edu in order to answer a series of questions specified below.

Crawling library: http://code.google.com/p/crawler4j/ or https://github.com/Mondego/crawler4py

Project skeleton: http://www.ics.uci.edu/~kay/courses/i141/hw/Assignment3.zip

General specifications:

This assignment may be done individually or in groups of 2 or 3. For each size group, the expectations are the same; solo workers save time by not having to talk to anyone while group workers can split the work with the addition of communications costs. All group members receive the same score (except possibly as modified by the TA in the interview phase; see below).
You may use text processing code that you or any classmate wrote for the previous assignment. You may not use crawler code written by non-group-member classmates. Use code found over the Internet at your own peril—it may not do exactly what the assignment requires. If you do end up using code you find on the Internet, you must disclose the origin of the code. As stated in the collaboration guidelines, concealing the origin of a piece of code is plagiarism.
Use Piazza for general questions whose answers can benefit everybody.
You may use Java, Python, or Scheme/Racket for this assignment. As before, Java is the safest choice because the assignment is written with Java in mind and the skeleton is in Java. This time there are some resources provided in Python; you'd still have to translate the skeleton. The Python resources, being newer, may be less robust (meaning you should allow extra time in case of snags). But there may be extra credit for identifying and documenting specific bugs in the Python resources.
Your task is to fill in the one method in the skeleton according to its specification. You may create additional methods and classes where necessary, provided the interface is the same.

More specific specifications:

(Very important for getting credit) Set the name of your crawler's User Agent to this precise string: UCI Inf141-CS121 crawler StudentID(s), where the last part is the eight-digit student ID of each team member, separated by one space.
Start with the seed http://www.ics.uci.edu and crawl from there. Crawl only the domain ics.uci.edu and all of its subdomains (anything.ics.uci.edu).
(Very important for politeness) Wait at least 300ms between page requests to the same subdomain.
We will verify execution by checking the server logs of some pages in the domain. These servers are in any correctly-written crawler path. If we don't find log entries for your student ID, that means that your crawler didn't perform correctly or you didn't set its name correctly. If we can't verify that your crawler did run successfully, we will assume that it didn't.
At points, this assignment may be underspecified (i.e., not fully describe what to do in every situation). In those cases, post your questions on Piazza or check with the TA. For minor issues, make your own assumptions and document them.

Questions:

How much time did it take to crawl the entire domain?
How many unique pages did you find in the entire domain? (Uniqueness is established by the URL, not the page's content.)
How many subdomains did you find? Submit the list of subdomains ordered alphabetically and the number of unique pages detected in each subdomain. The file should be called Subdomains.txt, and its content should be lines containing the URL, a comma, a space, and the number.
What is the longest page in terms of number of words? (Don't count HTML markup as words.)
What are the 500 most common words in this domain? (Ignore English stop words, which can be found, for example, at http://www.ranks.nl/stopwords.) Submit the list of common words ordered by frequency (and alphabetically for words with the same frequency) in a file called CommonWords.txt.

Submitting your assignment: Your will submit your work via Checkmate. For groups of two or three, just one of you should submit all parts of the assignment; the names of all group members must appear near the top of every submitted file.

First, submit a single zip file that matches the structure of the project skeleton and contains your code in the src folder. Second, submit a plain text file called Answers.txt with your answers to questions 1, 2, and 4. Third, submit the Subdomains.txt file described above. Fourth, submit the CommonWords.txt file described above. Fifth, if there is anything else you wish to communicate to the TA, such as implementation assumptions made, this should be placed into an additional README.txt file included in your source code zip file.

Evaluation criteria: Your assignment will be graded on the following three criteria.

Correctness: (a) Did you crawl the domain correctly? We will verify that in our servers’ logs. (b) Does your crawler pass our tests of the crawl method? (c) Are your answers to the questions reasonable? (Note that correct answers are not valid without evidence of correct crawling. Answers by different crawlers will vary due to a number of factors. “Correctness” of answers will be based on how reasonable they are.)
Style/documentation/aesthetics: Is the program clearly documented and well written?
Understanding: You will have an in-person meeting with the TA where you will be asked questions about your crawler’s implementation. All members of the group are expected to demonstrate solid understanding of the crawler. In cases where understanding is clearly lacking, the scores will reflect that.

David G. Kay, kay@uci.edu
Wednesday, February 4, 2015 12:42 PM