CS 221, Information Retrieval, Winter 2009-2010 Department of Informatics, UCI

Week 1: Web Search Basics

Tuesday (Lecture 1) (1/5): Learning Objective: To understand the scope and objectives of this course

Due Today:

1. Wikipedia entry on Vannevar Bush

2. "As We May Think" The Atlantic Monthly, July, 1945. (reprinted in ACM CHI Interactions, March 1996)

3. Textbook Chapter 19: Web Search Basics

4."Simple, Proven Approaches to Text Retrieval" by Robertson and Jones. Commentary: "This paper provides a brief but well informed and technically accurate overview of the state of the art in text retrieval, at least up to 1997. It introduces the ideas of terms and matching, term weighting strategies, relevance weighting, a little on data structures and the evidence for their effectiveness. In my view it does an exemplary job of introducing the terminology of IR and the main issues in text retrieval for a numerate and technically well informed audience. It also has a very well chosen list of references."

Notes and artifacts from class :

Go over names
Go over syllabus
Web Search Basics	Non-Quicktime version

Thursday (Lecture 2) (1/7): Learning Objective: To understand the social and economic history of search

Notes and artifacts from class:

Web Search Basics	Non-Quicktime version
Intermission:

Week 2: Web Search Basics / Web Crawling and Indices

Sunday (1/10):

Due Today:

Assignment 01

Tuesday (Lecture 3) (1/12): Learning Objective:

To understand how search engines meet the needs of their users.

To understand the impact of spam on search engines

Due Today:

1. Textbook Chapter 20 : Web Crawling and Indices

2. "Stuff I’ve seen: A system for personal information retrieval and re-use " by (S. Dumais, E. Cutell, J. Cadiz, G. Jancke, R. Sarin, and D. Robbins, SIGIR, 2003) Commentary: "This paper addresses an increasingly important problem – how to search and manage personal collections of electronic information. ... it addresses an important user-centered problem. ...this paper presents a practical user interface to make the system useful. ..., the paper includes large scale, user-oriented testing that demonstrates the efficacy of the system. ..., the evaluation uses both quantitative and qualitative data to make its case. I think this paper is destined to be a classic because it may eventually define how people manage their files for a decade. Moreover, it is well-written and can serve as a good model for developers doing system design and evaluation, and for students learning about IR systems and evaluation."

Notes and artifacts from class :

User Needs & Spam

Non-Quicktime version

Lecture 03 Audio

Intermission

Thursday (Lecture 4) (1/14): Learning Objective:

To understand what is involved with webcrawling

To understand an architecture for webcrawling (Mercatur)

Due Today:

1. Please read this article: "Google 'may pull out of China after Gmail cyber attack' "

Notes and artifacts from class:

Spam

Non-Quicktime version

Web Search Basics

Non-Quicktime version

Intermission:

Discussion of article above

Week 3: Index Construction

Monday (1/18):

Due Today:

Quiz 01 (on readings from week 1 and week 2)

Tuesday (Lecture 5) (1/19): Learning Objective:

To understand what is involved with webcrawling

To understand an architecture for frontier management (Mercator)

Notes and artifacts from class:

Web Crawling, DNS, URL Frontier Queue

Non-Quicktime version

Lecture 05 Audio

Intermission

Google Real-Time Search

Thursday (Lecture 6) (1/21): Learning Objective:

To understand what an index is and how it is created.

To understand what a Term-Document matrix is

To understand algorithms for creating an index

Notes and artifacts from class:

Connectivity Server,

Index Construction

Non-Quicktime version

Lecture 06 Audio

Intermission

Week 4: Querying, Scoring

Sunday (1/24):

Due Today:

Assignment 02

Tuesday (Lecture 7) (1/26): Learning Objective:

To understand MapReduce

Due Today:

1. Textbook Chapter 4 : Index Construction

2. "The WebGraph Framework I: Compression Techniques " by (P. Boldi and S. Vigna, WWW 2004)

Abstract: "Studying web graphs is often difficult due to their large size. Recently,several proposals have been published about various techniques that allow to store a web graph in memory in a limited space, exploiting the inner redundancies of the web. The WebGraph framework is a suite of codes, algorithms and tools that aims at making it easy to manipulate large web graphs. This papers presents the compression techniques used in WebGraph, which are centred around referentiation and intervalisation (which in turn are dual to each other). WebGraph can compress the WebBase graph (118 Mnodes, 1 Glinks)in as little as 3.08 bits per link, and its transposed version in as littleas 2.89 bits per link.

3. "The Web As a Graph" by R. Kumar, P Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, E. Upfal, PODS 2000)

Abstract: "The pages and hyperlinks of the World-Wide Web may be viewed as nodes and edges in a directed graph. This graph has about a billion nodes today, several billion links, and appears to grow exponentially with time. There are many reasons—mathematical, sociological, and commercial—for studying the evolution of this graph. We first review a set of algorithms that operate on the Web graph, addressing problems from Web search, automatic community discovery, and classification. We then recall a number of measurements and properties of the Web graph. Noting that traditional random graph models do not explain these observations, we propose a new family of random graph models."

Notes and artifacts from class:

MapReduce

Non-Quicktime version

Lecture 08 Audio

Thursday (Lecture 8) (1/28):Learning Objective:

To look at how querying is supported by posting lists

Notes and artifacts from class:

Querying

Non-Quicktime version

Week 5: Scoring, Term Weighting and the Vector Space Model

Sunday (1/31):

Due Today:

Assignment 03 (Hadoop Light)

Tuesday (Lecture 9) (2/2): Learning Objective:

To understand techniques for producing ranked scores for queries

Due Today:

1. Textbook Chapter 1 : Boolean Retrieval

2. Textbook Chapter 6 : Scoring, term weighting & the vector space model

3. "The Anatomy of a Large-Scale Hypertextual Web Search Engine" by (S. Brin and L. Page, WWW1998) Commentary: "This paper (and the work it reports) has had more impact on everyday life than any other in the IR area. A major contribution of the paper is the recognition that some relevant search results are greatly more valued by searchers than others. By reflecting this in their evaluation procedures, Brin and Page were able to see the true value of web-specific methods like anchor text. The paper presents a highly efficient, scalable implementation of a ranking method which now delivers very high quality results to a billion people over billions of pages at about 6,000 queries per second. It also hints at the technology which Google users now take for granted: spam rejection, high speed query-based summaries, source clustering, and context(location)-sensitive search. IR and bibliometrics researchers had done it all (relevance, proximity, link analysis, efficiency, scalability, summarization, evaluation) before 1998 but this paper showed how to make it work on the web. For any non-IR engineer attempting to build a web-based retrieval system from scratch, this must be the first port of call."

Notes and artifacts from class:

Vector Space Scoring

PDF

Non-Quicktime version

Thursday (Lecture 10) (2/4):Learning Objective:

To understand how a posting list lends itself to efficient computation

Notes and artifacts from class:

Vector Space Scoring and Efficient Computation

PDF

Non-Quicktime version

Week 6: Matrix Decompositions and Latent Semantic Indexing

Tuesday (Lecture 11) (2/9): Learning Objective: To understand the transformation from term space to semantic space

Notes and artifacts from class:

PDF

Non-Quicktime version

PDF

Non-Quicktime version

PDF

Non-Quicktime version

Intermission

Thursday (Lecture 12) (2/11):Learning Objective: To understand how to run a query on a simple LSA system, To understand the background of PageRank

Notes and artifacts from class:

Querying with LSI and PageRank

PDF

Non-Quicktime version

PDF

Non-Quicktime version

PDF

Non-Quicktime version

Week 7: Link Analysis

Tuesday (Lecture 13) (2/16): Learning Objective: To understand how PageRank is calculated

Due Today:

1. Textbook Chapter 18 : Matrix Decompositions and Latent Semantic Analysis

2. Textbook Chapter 21 : Link Analysis

3. "Unsupervised Named-Entity Extraction from the Web: An Experimental Study " (Etzioni, et.al.) :åThis paper represents a new generation of IR work that attempts to do more than build a bag of words for information retrieval, but also attempts to make some sense of the information as well.

Notes and artifacts from class:

PageRank

PDF

Non-Quicktime version

Class Admin Notes

Thursday (Lecture 14) (2/18):Learning Objective: To understand why evaluation is needed and how it can be done

Notes and artifacts from class:

Evaluation in IR

PDF

Non-Quicktime version

Intermission

TinEye

Week 8: Evaluation in Information Retrieval

Monday (2/22):

Originally due today, now due anytime before Assignment 05:

Assignment 04 (Hadoop Heavy)

Tuesday (Lecture 15) (2/23): Learning Objective: To understand how to conduct ranked and unranked evaluation of IR engines

Notes and artifacts from class:

Evaluation in IR

PDF

Non-Quicktime version

Thursday (Lecture 16) (2/25):Learning Objective: To understand how Aardvark social search works and differs from web search

Notes and artifacts from class:

Aardvark Social Search

PDF

Non-Quicktime version

Week 9:Social Search, Advances in PageRank, PubSubHubbub, TokyoCabinet, Twitter Hosebird

Sunday (2/28):

Due Today:

Assignment 05 (Querying)

Monday (3/1):

Due Today:

1. "The Anatomy of a Large-Scale Social Search Engine" (Horowitz, Kamvar) :"So our goal with our paper is to follow their example by providing a thorough presentation of the approach, architecture, algorithms, interfaces, and issues involved with Aardvark's new social search paradigm. "

2. "Link Analysis for Private Weighted Graphs" (Sakuma, Kobayashi): "Our solutions are designed as privacy-preserving expansions of well-known link analysis methods, PageRank and HITS. The outcomes of our protocols are completely equivalent to those of PageRank and HITS. Furthermore, our protocols theoretically guarantee that the private link informa-
tion possessed by each node is not revealed to other nodes."

3."The impact of crawl policy on web search effectiveness" (Sakuma, Kobayashi) "Crawl selection policy has a direct influence on Web search effectiveness, because a useful page that is not selected for crawling will also be absent from search results. Yet there has been little or no work on measuring this effect."

Tuesday (Lecture 17) (3/2): Learning Objective: State of the Art using and calculating PageRank

Notes and artifacts from class:

PageRank Guided Crawl Techniques

PDF

Non-Quicktime version

Thursday (Lecture 18) (3/4):Learning Objective:To be able to differentiate the technical features of PubSubHubub and Twitter

Notes and artifacts from class:

Comparing PubSubHubbub and Twitter

PDF

Non-Quicktime version

Week 10: Alternative Vector Spaces, Social Search

Sunday (3/7):

Due Today:

1. "Using MapReduce to Compute PageRank " (Michael Nielsen)

3. "Good abandonment in mobile and PC internet search" (Li, Huffman, Tokuda)

4. "When more is less: the paradox of choice in search engine use" (Oulasvirta, Hukkinen, Schwartz)

5. "Sourcerer: mining and searching internet-scale software repositories" (Linstead, et.al.)

Tuesday (Lecture 19) (3/9):Learning Objective:Cool things going on with Page Rank

Notes and artifacts from class:

Implementing PageRank with MapReduce

PDF

Non-Quicktime version

Link Analysis for Private Weighted Graphs

PDF

Non-Quicktime version

Thursday (Lecture 20) (3/11):Learning Objective:User factors in Search Engines and Code Search

Notes and artifacts from class:

Good Abadondonment n Mobile and PC Search

PDF

Non-Quicktime version

When More is Less: The Paradox of Choice in Search Engine Use

PDF

Non-Quicktime version

Sourcerer: Mining and searching internet-scale software repositories

PDF

Non-Quicktime version

Finals Week:

Due this week:

* Assignment 06 (Alternative search engine)

* Quiz Questions

1. Write 10 multiple choice questions from anything covered after Quiz 2

2. Indicate where the question comes from.

3. Provide 4 possible answers

4. Indicate the correct answer

5. It can come from any combination of papers, text book or lectures. There is no requirement for coverage.

6. Submit a pdf file with your questions here: https://eee.uci.edu/toolbox/dropbox/index.php?op=openfolder&folder=191967

Final Projects:

Six Degree of Separation Finder

by COREY SCHANINGER and LAKSHMI THYAGARAJAN

Project report here. Demo site here.

This group put together a web interface and backend system that would search for a path between any two actors in IMDB. There are 1.5 million actors connected by 1.6 million movies. The key challenge of this project was to make queries fast for the user given the amount of paths. The current system doesn't calculate shortest path, but rather fastest path.

Future work includes a live updating of the search as it tries to find the shortest path, allowing the user to disallow certain connections and including other kinds of connections besides just being in a movie together.

Example: Nicolas Cage -> Amos & Andrew (1993) -> Jeff Blumenkrantz -> Anastasia (1997) -> Meg Ryan

Queries can take between 1 and and 183 seconds depending on the actor pair.

Bilingual Search Engine

by Even Cheng and Karthik Raman

Project report here. GUI not deployed to the web.

Let's say that you happen to know multiple languages. Why should your search engine give you results in just one? Instead maybe it should give you the best results in any language that you happen to know? That's the premise of this project.

Tell the computer the languages that you know and it will translate your query into all of those languages and find the best results from it's different language specific indices all ranked together. The key challenges for this project were managing translations and developing rankings on different corpora that could be compared.

Twitter Trending

by Guangqiang Li and Ye Wang

Project report here. GUI not deployed to the web.

Real time search is hot! It's the thrill of the moment to find out what's trending in the twitter-sphere. But wait? Why do I want to know everyone's trending terms? This project was about figuring out what is hot among just your friends. This group applied PageRank to Twitter to create TwitterRank. The key challenge for this project was determining how to create a graph from tweets, then using that graph to identify trending terms.

PhotoLyrics

by Nadine Amsel and Nathanael Lenart

Project report here. GUI not deployed to the web.

Why can't music visualizers be fully awesome? Well now they can with PhotoLyrics. Just upload your mp3 to the photolyrics server and it will be sliced and diced to give you back a slidehow of pictures based on the lyrics of the song. The key challenge for this project was similar to the TweetRank project. How do you determine what an important word is in the lyrics of a song? Well this project based it on it's tf-idf score from wikipedia.

The result is that when a user uploads "Eye of the Tiger", the system gets the info from the id3 tags in the file, gets the lyrics from the song from the web, determines which words are the most interesting and then shows a slideshow based on the interesting images.

Future enhancements include making this into a plug-in for iTunes

Code Snippet Search

by Phitchayaphong Tantikul and Hye Jung Choi

Project report here. GUI located here.

Although there is a lot of code sharing that is going on on the web right now, how many people are actually search code search engines for things to cut and paste? I bet very few. So did the authors of this project who focussed on developers who are searching for code because they need a quick refresher on how to program with a particular data structure or need to remember how to connect to a socket. So this project is a search engine for code tutorials. The key challenge for this project was finding the pages that had both code and text in it, crawling it, parsing it, indexing it, and then presenting it to users in a clever way that showed both the code and a snippet talking about how to use it.

It's an information retrieval engine that I would use all the time!

Calculating the PageRank of Wikipedia in Hadoop

by Xiaozhi Yu

Project report here. GUI not deployed to the web.

This project implemented a really well engineered implementation of PageRank for Hadoop. Some things that I learned from this project were how to iterate multiple MapReduce jobs from one uploaded jar, and I just appreciated the beauty of the way that the output file from one iteration was fed directly into the next iteration without any processing. Of course since this is on MapReduce this naturally scales to really large datasets.

Find Dense Clusters in the Web Graph

by Minh Doan, Ching-wei Huang, Siripen Pongpaichet

Project report here. GUI not deployed to the web.

This project was another great use of MapReduce. Instead of finding the PageRank of the pages in wikipedia however, these authors using pagerank to find dense clusters: sets of pages with dense interconnections among them. This algorithm could be applied to any graph data structure, but on the web graph it will identify lots of self-referencing pages. This would be helpful for identifying clusters of link farms, spammers who are all linking to each other, or other unusual occurences in the graph. In the case of wikipedia it found collections of pages that were about the same thing. A set of pages about the mountains of France, and a set of pages describing the Federalist papers.

Applying Gamma Coding to Posting List Storage

by Ali Bagherzandi and Kerim Yassin Oktay

Project report here. GUI located here. Video demo here.

This project took on the challenge of increasing the speed of our wikipedia search engine and also increasing the set of words which could be considered in the cosine ranking. The key challenge here was how to reduce the size of the posting list into something that could be quickly read from disk - as this was the main performance bottleneck. The answer was gamma coding which is a technique that stores small numbers with fewer bits than larger numbers. Normally integers have an assumed upper bound, INT_MAX, gamma_coding makes no such assumption and efficiently represents numbers that have no upper bound. By using gamma coding the posting list size was reduced and queries were much much faster.