Topic

1/27/2009 Lecture Notes

Kick-off:

No one as Irish as Brack O'Bama

Learning Objective:

Explain the principles behind the design of mercator URL frontier architecture

Explain the things that we are interested in capturing during a crawl

Vector Space Model

Posting list

Connectivity Graph

Update on Assignment #3 and schedule

crawler4j is now on version 1.0.3

group members post names

Less than one week to complete

How are we doing?

Cards

Review Cards

How fast is fast enough for palindromes?

Do crawlers continue to crawl forever?

Should our crawler crawl forever?

What are some cool things crawlers can do?

Send you alerts when they find something

When does a crawler use front queues vs. back queues?

What about for our assignment?

What do we do when the back queues fill up?

Where do the URLs come from that need to be crawled?

seed set then outlinks

What is duplication and shingling

Why is crawling so complicated?

scale

social contracts

adversarial technology

need for robustness

Is mercator the best architecture for crawling?

How is round robin biased toward highest priority?

Do crawlers crawl randomly?

What's the host splitter about?

How do the queues recognize a web page loop?

Video break

nru