Informatics 141/ Computer Science 121: Information Retrieval

Assignment 03

Winter 2008

Department of Informatics

Donald Bren School of Information and Computer Sciences

University of California, Irvine

Home | Administrative Policies | Course Structure | Resources & Materials | Calendar

Due 2/8/2008

  1. Java Program (100%)
    1. Administration
      1. You may work in teams of 1, 2 or 3 people. Please register your team with Jam.
    2. Write a program to crawl the web.
      1. Inputs
        1. A URL start Page (the seed set)
        2. A regular expression
          1. Pages are only crawled if the url matches this regular expression.
      2. Output
        1. A graph of the crawled pages
        2. An index
          1. of (term, document) pairs
          2. of (document, image) pairs
      3. Structure
        1. You should use two libraries:
          1. WebSphinx
            1. This is a crawler library
            2. You are not to use the "crawler workbench"
            3. You should compose the necessary components (green on right) to build the architecture on the right
          2. WebGraph
            1. This will build a compressed adjacency graph for you
    3. assignment 03 architecture
    4. Using this architecture search for the following features
      1. Find the longest Palindrome in wikipedia that is not on a page about palindromes.
      2. Find the longest Lipogram (letter s) in wikipedia that is not on a page about lipograms.
      3. Find the longest Rhopalic in wikipedia.
    5. After crawling the web, use your web graph calculate the shortest path between
      1. from: http://en.wikipedia.org/wiki/Irvine%2C_California
      2. to: http://en.wikipedia.org/wiki/Bubonic_plague
      3. that stays in the English pages of Wikipedia.org
    6. Evaluation:
      1. Produce the palindrome, lipogram and rhopaic and source URL from part 3.
        1. Grades will be assigned according to the length of the sequence.
      2. Produce your sequence of URLs from part 4.
        1. Show this sequence as a collection of screen shots indicating the path so that the instructors can verify the path manually.
        2. Show the anchor text and the URL so that it is easy to verify.
        3. Grades will be assigned according to the length of the sequence.
      3. Send your results to the TA
    7. Train your group
      1. Each member of your group must be able to run your architecture on their own for Assignment 04.
      2. Prepare for Assignment 04.
    8. Submit a group self-evaluation
      1. Format to be determined