CS 221, Information Retrieval, Winter 2009-2010 Department of Informatics, UCI

Assignment 02

Goals
1. To teach you how difficult crawling the web is in terms of scale and temporal requirements. The software itself is not the hard part of this assignment, it is managing all the data that comes back. This is only one web domain (wikipedia) after all.
2. To teach you how difficult it is to process text that is not written for computers to consume. Again this is a very structured domain as far as text goes.
3. To make the point that web-scale web-centric activities do not lend themselves to "completeness". In some sense you are never done. So thinking about web algorithms in terms of "finishing" doesn't make sense. You have to change your mindset to "best possible" given resources.
Java program (100%)
1. Write a program to crawl wikipedia
2. Administration
  1. You may work in teams of 1, 2.
  2. If you do not have access to your own resources, run your program on an openlab machine in the ICS unix environment.
    1. This program should be able to access the "/extra/grad_space" as per these instructions
3. Suggested approach .
  1. Use crawler4j as your crawling engine. It is a java library.
  2. Follow the instructions at http://code.google.com/p/crawler4j/ and create your MyCrawler and Controller classes.
  3. Remember to set the maximum heap of java to an acceptable value. For example, if your machine has 2GB RAM. You may want to assign 1GB RAM to your crawlers. You can add -Xmx1024M to your java command line parameters.
  4. Make sure that you dump your partial results in a permanent storage as you crawl. For example, you can write palindromes longer than 20 characters in a file and after finishing the crawl, report 10 longest one. It is always possible that your crawl crashes in the mean time or you pass the deadline. So, it's a good practice to have your partial results in a file. Also make sure to flush the file whenever you write in it.
  5. If you're crawling on a remote linux machine like openlab machines, make sure to use the nohup command. Otherwise your crawling, which may take several days, will be stopped if your connection is disconnected for even a second. Search the web for how to use this command.
4. Inputs
  1. A URL start Page (the seed set)
    1. A regular expression
    2. Pages are only crawled if the url matches this regular expression.
5. Outputs
  1. Find the 10 longest Palindromes in english content pages of wikipedia that are not on a page about palindromes. Present them in a table with the source page from which they came.
  2. Find the 10 Rhopalics with the most number of words in english content pages of wikipedia. Present them in a table with the source page from which they came.
  3. Number of Pages Crawled.
  4. Number of Links found in these pages.
  5. Total Size of the Downloaded Content.
  6. Total Size of the Extracted Text of the Pages.
  7. What was the docid of the Page with this URL in your crawl: http://en.wikipedia.org/wiki/Barack_Obama
Submitting your assignment
1. Turn your report into the dropbox created for this assignment on EEE.
2. Make the file name <StudentID>-<StudentIID>-Assignment02.pdf
3. Please include the full names of all of your group members as appropriate in the document
Evaluation:
1. Produce the palindrome and rhopalic and source URLs from part 3.
  1. A portion of the grade will be assigned according to the length of the sequence compared to the longest known sequence.
2. Adhering to administrative requirements