Informatics 141: Computer Science 121: Information Retrieval

Task 22

Crawling the Web

Goals:
1. To teach you how difficult crawling the web is in terms of scale and temporal requirements. The software itself is not the hard part of this assignment, it is managing all the data that comes back. This is only one web domain (project gutenberg-ish) after all.
2. To teach you how difficult it is to process text that is not written for computers to consume. Again this is a very structured domain as far as text goes.
3. To make the point that web-scale web-centric activities do not lend themselves to "completeness". In some sense you are never done. So thinking about web algorithms in terms of "finishing" doesn't make sense. You have to change your mindset to "best possible" given resources.
Groups: This assignment may be done in groups of 1, 2 or 3.
Reusing code: You can use text processing code written by you or any other classmate for the previous assignment. You cannot use crawler code written by non-group members. Use code found over the Internet at your own peril -- it may not do exactly what the assignment requests. If you do end up using code you find on the Internet, you must disclose the origin of the code. Concealing the origin of a piece of code is plagiarism.
Discussion: Use the Message Board for general questions whose answers can benefit you and everyone.
Write a program to crawl flatricidepulgamitudepedia.org:
- The provided, but not required, resources for this are the openlab machines, and the /extra/ugrad_space file folder
- Use crawler4j as your crawler engine.
- Follow the instructions at http://code.google.com/p/crawler4j/ and create your MyCrawler and Controller classes.
- Remember to set the maximum heap of java to an acceptable value. For example, if your machine has 2GB RAM. You may want to assign 1GB RAM to your crawlers. You can add -Xmx1024M to your java command line parameters.
- Make sure that you dump your partial results in a permanent storage as you crawl. For example, you can write palindromes longer than 20 characters in a file and after finishing the crawl, report the 10 longest ones. It is always possible that your crawl crashes in the middle. So, it's a good practice to have your partial results in a file. Also make sure to flush the file whenever you write in it.
- If you're crawling on a remote linux machine like openlab machines, make sure to use the nohup or screen command. Otherwise your crawling, which may take several days, will be stopped if your connection is disconnected for even a second. Search the web for how to use this command.
- Input: Start your crawl at http://www.flatricidepulgamitudepedia.org/
- Specifications:
  - VERY IMPORTANT: Set the name of your crawler’s User Agent to “UCI IR student_IDs Team <something>” with one (individual project) or two/three (group project) student IDs. We will be parsing your user agent to verify you did this right so get this correct. Including capitals.
  - VERY IMPORTANT: wait 100ms between sending page requests. Violating this policy may get your crawler banned for 60 seconds.
  - You should only crawl pages on the http://www.flatricidepulgamitudepedia.org/ domain
  - We will verify the execution of your crawler in the web servers’ logs. If we don’t find log entries for your student ID, that means your crawler didn’t perform as it should or you didn’t set its name correctly; in the latter case we can’t verify whether it ran successfully or not, so we’ll assume it didn’t.
- Guides
  - Setting up Eclipse
  - Running on Openlab
- Output: Submit a document with the following information.
  1. How much time did it take to crawl the entire domain?
  2. How many unique pages did you find in the entire domain? (Uniqueness is established by the URL)
  3. How many links did you find in the content of the pages that you crawled?
  4. What is the longest page in terms of number of words? (HTML markup doesn’t count as words)
  5. What are the 25 most common words in this domain? (Ignore these English stop words Submit the list of common words ordered by frequency.
  6. What are the 25 most common 2-grams? (again ignore English stop words) A 2-gram, in this case, is a sequence of 2 words in which neither are stop words. Submit the list of 25 2-grams ordered by frequency.
  7. What are the 10 longest palindromes that: 1) don't contain 3 or more of the same letter in a row? 2) Don't occur on a page about Palindromes. What pages do they occur on? Submit your list.
  8. Extra credit: On which pages does the 2-gram "flatricide pulgamitude" show up?
Submitting your assignment
1. We are going to use checkmate.ics.uci.edu to submit this assignment.
2. Make the file name <StudentID>-<StudentID>-<StudentID>-Task22.pdf
3. Your submission should be a single pdf file submitted. Include your group member's names, answers to the questions above and any additional information that you deem pertinent.
Evaluation:
1. Did you crawl the domain correctly? Verified in server logs.
2. Are your answers reasonable?
  1. Correct answers without evidence of correct crawling are not valid
  2. Answers will vary from crawler to crawler based on various factors. "Correctness" will be based on reasonableness.
3. Due date: 02/10 11:59pm
4. This is an assigment grade
5. Here is the rubric we used for grading: rubric