Winter 2009: Informatics 141 : Information Retrieval : Assignment 03

Goals:
1. This assignment is designed to:
  1. To teach you how difficult crawling the web is in terms of scale and temporal requirements. The software itself is not the hard part of this assignment, it is managing all the data. This is only one web site (wikipedia) after all.
  2. To teach you how difficult it is to process text that is not written for computers to consume. Again this is a very structured domain as far as text goes.
  3. To make the point that web-scale web-centric activities do not lend themselves to "completeness". In some sense you are never done. So thinking about web algorithms in terms of "finishing" doesn't make sense. You have to change your mindset to "best possible" given resources.
Administration:
1. You may work in teams of 1, 2 or 3.
Write a Java Program
1. Write a program to crawl the web.
2. Suggested approach
  1. Use crawler4j as your crawling engine. It is a java library.
  2. Follow the instructions at http://code.google.com/p/crawler4j/ and create your MyCrawler and Controller classes.
  3. Remember to set the maximum heap of java to an acceptable value. For example, if your machine has 2GB RAM. You may want to assign 1GB RAM to your crawlers. You can add -Xmx1024M to your java command line parameters.
  4. Make sure that you dump your partial results in a permanent storage as you crawl. For example, you can write palindromes longer than 20 characters in a file and after finishing the crawl, report 10 longest one. It is always possible that your crawl crashes in the mean time or you pass the deadline. So, it's a good practice to have your partial results in a file. Also make sure to flush the file whenever you write in it.
  5. If you're crawling on a remote linux machine like openlab machines, make sure to use the nohup command. Otherwise your crawling which may take several days will be stopped if your connection is disconnected for even a second. Search the web for how to use this command.
  6. Attend the discussion class. All of the above suggestions and other hints will be covered there.
3. Inputs
  1. A URL start Page (the seed set)
  2. A regular expression
    1. Pages are only crawled if the url matches this regular expression.
4. Outputs
  1. Find the 10 longest Palindromes in english content pages of wikipedia that is not on a page about palindromes. Present them in a table with the source page from which they came.
  2. Find the10 longest Lipograms (letter "E"/"e") in english content pages of wikipedia that is not on a page about lipograms. Present them in a table with the source page from which they came.
  3. Find the 10 Rhopalics with the most number of words in english content pages of wikipedia. Present them in a table with the source page from which they came.
  4. Number of Pages Crawled.
  5. Number of Links found in these pages.
  6. Total Size of the Downloaded Content.
  7. Total Size of the Extracted Text of the Pages.
  8. What was the docid of the Page with this URL in your crawl: http://en.wikipedia.org/wiki/Barack_Obama
Submitting your assignment
1. We are going to use checkmate.ics.uci.edu to submit this assignment.
2. Make the file name <StudentID>-<StudentIID>-<StudentID>-Assignment03.pdf
Evaluation:
1. (90%) Produce the palindrome, lipogram and rhopalic and source URLs from part 3.
  1. Grades will be assigned according to the length of the sequence compared to the longest known sequence.
2. (10%) Adhering to administrative requirements
Train your group
1. Each member of your group must be able to run your architecture on their own for Assignment 04.