Crawling the Web
- Goals:
- To teach you how difficult crawling the
web is in terms of scale and temporal
requirements. The software itself is not the
hard part of this assignment, it is managing
all the data that comes back. This is only
one web domain (project gutenberg-ish) after all.
- To teach you how difficult it is to
process text that is not written for
computers to consume. Again this is a very
structured domain as far as text goes.
- To make the point that web-scale
web-centric activities do not lend
themselves to "completeness". In some sense
you are never done. So thinking about web
algorithms in terms of "finishing" doesn't
make sense. You have to change your mindset
to "best possible" given resources.
- Groups: This assignment may be done in groups of 1, 2 or 3.
- Reusing code: You can use text
processing code written by you or any other
classmate for the previous assignment. You cannot
use crawler code written by non-group members. Use code found over the Internet at your own peril -- it may not do exactly what the assignment requests. If you do end up using code you find on the Internet, you must disclose the origin of the code. Concealing the origin of a piece of code is plagiarism.
- Discussion: Use the Message Board for general questions whose answers can benefit you and everyone.
- Write a program to crawl flatricidepulgamitudepedia.org:
- The provided, but not required, resources for this are the openlab machines, and the /extra/ugrad_space file folder
- Use crawler4j
as your crawler engine.
- Follow the instructions at
http://code.google.com/p/crawler4j/ and
create your MyCrawler and Controller
classes.
- Remember to set the maximum heap of
java to an acceptable value. For example, if
your machine has 2GB RAM. You may want to
assign 1GB RAM to your crawlers. You can add
-Xmx1024M to your java command line
parameters.
- Make sure that you dump your partial
results in a permanent storage as you crawl.
For example, you can write palindromes
longer than 20 characters in a file and
after finishing the crawl, report the 10 longest
ones. It is always possible that your crawl
crashes in the middle. So, it's a good practice to have your partial results in a file. Also make sure to flush the file whenever you write in it.
- If you're crawling on a remote linux
machine like openlab machines, make sure to
use the nohup or screen command. Otherwise your crawling, which may take several days, will be stopped if your connection is disconnected for even a second. Search the web for how to use this command.
- Input: Start your crawl at http://www.flatricidepulgamitudepedia.org/
- Specifications:
- VERY IMPORTANT: Set the name of
your crawler’s User Agent to “UCI IR
student_IDs Team <something>”
with one (individual project) or
two/three (group project) student
IDs. We will be parsing your user
agent to verify you did this right
so get this correct. Including
capitals.
- VERY IMPORTANT: wait 100ms
between sending page requests. Violating this
policy may get your crawler
banned for 60 seconds.
- You should only crawl pages on the http://www.flatricidepulgamitudepedia.org/ domain
- We will verify the execution of
your crawler in the web servers’
logs. If we don’t find log entries
for your student ID, that means your
crawler didn’t perform as it
should or you didn’t set its
name correctly; in the
latter case we can’t verify
whether it ran successfully
or not, so we’ll assume it
didn’t.
- Guides
- Output: Submit a
document with the following information.
- How much time did it take to crawl the entire domain?
- How many unique pages did you find in the entire domain? (Uniqueness is established by the URL)
- How many links did you find in the content of the pages that you crawled?
- What is the longest page in terms of number of words? (HTML markup doesn’t count as words)
- What are the 25 most common words in this domain? (Ignore these English stop words Submit the list of common words ordered by frequency.
- What are the 25 most common 2-grams? (again ignore English stop
words) A 2-gram, in this case, is a
sequence of 2 words in which neither
are stop words. Submit the list of
25 2-grams ordered by frequency.
- What are the 10 longest
palindromes that: 1) don't contain
3 or more of the same letter in a
row? 2) Don't occur on a page about
Palindromes. What pages do they
occur on? Submit your list.
- Extra credit: On which pages
does the 2-gram "flatricide
pulgamitude" show up?
- Submitting your assignment
- We are going to use checkmate.ics.uci.edu to submit this assignment.
- Make the file name <StudentID>-<StudentID>-<StudentID>-Task22.pdf
- Your submission should be a single pdf
file submitted. Include your group member's
names, answers to the questions above and
any additional information that you deem
pertinent.
- Evaluation:
Correctness:
- Did you crawl the domain correctly? Verified in server logs.
- Are your answers reasonable?
- Correct answers without evidence of correct crawling are not valid
- Answers will vary from crawler to crawler based on various factors. "Correctness" will be based on reasonableness.
- Due date: 02/10 11:59pm
- This is an assigment grade
- Here is the rubric we used for grading: rubric