Topic
V 1/27/2009 Lecture Notes
V Kick-off:
V No one as Irish as Brack O'Bama
V Learning Objective:
* Explain the principles behind the design of mercator URL frontier architecture
V Explain the things that we are interested in capturing during a crawl
V Vector Space Model
* Posting list
* Connectivity Graph
V Update on Assignment #3 and schedule
* crawler4j is now on version 1.0.3
* group members post names
V Less than one week to complete
* How are we doing?
* Cards
V Review Cards
* How fast is fast enough for palindromes?
V Do crawlers continue to crawl forever?
* Should our crawler crawl forever?
V What are some cool things crawlers can do?
* Send you alerts when they find something
V When does a crawler use front queues vs. back queues?
* What about for our assignment?
* What do we do when the back queues fill up?
V Where do the URLs come from that need to be crawled?
* seed set then outlinks
* What is duplication and shingling
V Why is crawling so complicated?
* scale
* social contracts
* adversarial technology
* need for robustness
* Is mercator the best architecture for crawling?
* How is round robin biased toward highest priority?
* Do crawlers crawl randomly?
* What's the host splitter about?
* How do the queues recognize a web page loop?
V Video break
V nru