INFX 141 / CS 121 • DAVID G. KAY • UC IRVINE • WINTER 2015

Assignment #3
Crawling

Goal: Using the specified library and program skeleton, write a program to crawl the domain www.ics.uci.edu in order to answer a series of questions specified below.

Crawling library: http://code.google.com/p/crawler4j/ or https://github.com/Mondego/crawler4py

Project skeleton: http://www.ics.uci.edu/~kay/courses/i141/hw/Assignment3.zip

General specifications:

More specific specifications:

Questions:

  1. How much time did it take to crawl the entire domain?
  2. How many unique pages did you find in the entire domain? (Uniqueness is established by the URL, not the page's content.)
  3. How many subdomains did you find? Submit the list of subdomains ordered alphabetically and the number of unique pages detected in each subdomain. The file should be called Subdomains.txt, and its content should be lines containing the URL, a comma, a space, and the number.
  4. What is the longest page in terms of number of words? (Don't count HTML markup as words.)
  5. What are the 500 most common words in this domain? (Ignore English stop words, which can be found, for example, at http://www.ranks.nl/stopwords.) Submit the list of common words ordered by frequency (and alphabetically for words with the same frequency) in a file called CommonWords.txt.

Submitting your assignment: Your will submit your work via Checkmate. For groups of two or three, just one of you should submit all parts of the assignment; the names of all group members must appear near the top of every submitted file.

First, submit a single zip file that matches the structure of the project skeleton and contains your code in the src folder. Second, submit a plain text file called Answers.txt with your answers to questions 1, 2, and 4. Third, submit the Subdomains.txt file described above. Fourth, submit the CommonWords.txt file described above. Fifth, if there is anything else you wish to communicate to the TA, such as implementation assumptions made, this should be placed into an additional README.txt file included in your source code zip file.

 

Evaluation criteria: Your assignment will be graded on the following three criteria.

  1. Correctness: (a) Did you crawl the domain correctly? We will verify that in our servers’ logs. (b) Does your crawler pass our tests of the crawl method? (c) Are your answers to the questions reasonable? (Note that correct answers are not valid without evidence of correct crawling. Answers by different crawlers will vary due to a number of factors. “Correctness” of answers will be based on how reasonable they are.)
  2. Style/documentation/aesthetics: Is the program clearly documented and well written?
  3. Understanding: You will have an in-person meeting with the TA where you will be asked questions about your crawler’s implementation. All members of the group are expected to demonstrate solid understanding of the crawler. In cases where understanding is clearly lacking, the scores will reflect that.

David G. Kay, kay@uci.edu
Wednesday, February 4, 2015 12:42 PM