Computer Science 221: Information Retrieval:

Materials

Fall 2008

Department of Informatics

Donald Bren School of Information and Computer Sciences

University of California, Irvine

Home | Administrative Policies | Course Structure | Resources & Materials | (Calendar )

Assignment Schedule:

Weekly Materials
M
T
W
R
F
Week 01

September 29

30
October 1
2
3

Web Search Basics

  1. Wikipedia entry on Vannevar Bush
  2. "As We May Think" The Atlantic Monthly, July, 1945. (reprinted in ACM CHI Interactions, March 1996)
  3. Textbook Chapter 19: Web Search Basics
  4. "Simple, Proven Approaches to Text Retrieval" by Robertson and Jones
    1. Commentary: "This paper provides a brief but well informed and technically accurate overview of the state of the art in text retrieval, at least up to 1997. It introduces the ideas of terms and matching, term weighting strategies, relevance weighting, a little on data structures and the evidence for their effectiveness. In my view it does an exemplary job of introducing the terminology of IR and the main issues in text retrieval for a numerate and technically well informed audience. It also has a very well chosen list of references."

Lecture 01

(Notes,Slides)

 

Lecture 02

(Notes, Slides)

Lecture 03

(Notes,Slides)

Assignment 01 is due

Week 02
6
7
8
9
10

Web Search Basics

(continued)

Lecture 04

(Slides)

 

Lecture 05

(Notes,Slides)

 

Lecture 06

(Notes, Discussion)

Week 03
13
14
15
16
17

Web Crawling and Indices

  1. Textbook Chapter 20 : Web Crawling and Indices
  2. "Stuff I’ve seen: A system for personal information retrieval and re-use " by (S. Dumais, E. Cutell, J. Cadiz, G. Jancke, R. Sarin, and D. Robbins, SIGIR, 2003)
    1. Commentary: "This paper addresses an increasingly important problem – how to search and manage personal collections of electronic information. ... it addresses an important user-centered problem. ...this paper presents a practical user interface to make the system useful. ..., the paper includes large scale, user-oriented testing that demonstrates the efficacy of the system. ..., the evaluation uses both quantitative and qualitative data to make its case. I think this paper is destined to be a classic because it may eventually define how people manage their files for a decade. Moreover, it is well-written and can serve as a good model for developers doing system design and evaluation, and for students learning about IR systems and evaluation."

Lecture 07

(Notes,Slides)

Lecture 08

(Slides)

 

Lecture 09

(Slides)

 

Week 04
20
21
22
23
24

Index Construction

  1. Textbook Chapter 4 : Index Construction
  2. "The WebGraph Framework I: Compression Techniques " by (P. Boldi and S. Vigna, WWW 2004)
    1. Abstract: "Studying web graphs is often difficult due to their large size. Recently,several proposals have been published about various techniques that allow to store a web graph in memory in a limited space, exploiting the inner redundancies of the web. The WebGraph framework is a suite of codes, algorithms and tools that aims at making it easy to manipulate large web graphs. This papers presents the compression techniques used in WebGraph, which are centred around referentiation and intervalisation (which in turn are dual to each other). WebGraph can compress the WebBase graph (118 Mnodes, 1 Glinks)in as little as 3.08 bits per link, and its transposed version in as littleas 2.89 bits per link.
  3. "The Web As a Graph" by R. Kumar, P Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, E. Upfal, PODS 2000)
    1. Abstract: "The pages and hyperlinks of the World-Wide Web may be viewed as nodes and edges in a directed graph. This graph has about a billion nodes today, several billion links, and appears to grow exponentially with time. There are many reasons—mathematical, sociological, and commercial—for studying the evolution of this graph. We first review a set of algorithms that operate on the Web graph, addressing problems from Web search, automatic community discovery, and classification. We then recall a number of measurements and properties of the Web graph. Noting that traditional random graph models do not explain these observations, we propose a new family of random graph models."

 

Lecture 10

(Slides)

Quiz 02

 

Lecture 11

(Slides)

Lecture 12

Web Crawling Follow-up

(Slides)

Web User Follow-up

(Slides)

 

Week 05
27
28
29
30
31

Querying, Scoring

  1. Textbook Chapter 1 : Boolean Retrieval
  2. Textbook Chapter 6 : Scoring, term weighting & the vector space model
  3. "The Anatomy of a Large-Scale Hypertextual Web Search Engine" by (S. Brin and L. Page, WWW1998)
    1. Commentary: "This paper (and the work it reports) has had more impact on everyday life than any other in the IR area. A major contribution of the paper is the recognition that some relevant search results are greatly more valued by searchers than others. By reflecting this in their evaluation procedures, Brin and Page were able to see the true value of web-specific methods like anchor text. The paper presents a highly efficient, scalable implementation of a ranking method which now delivers very high quality results to a billion people over billions of pages at about 6,000 queries per second. It also hints at the technology which Google users now take for granted: spam rejection, high speed query-based summaries, source clustering, and context(location)-sensitive search. IR and bibliometrics researchers had done it all (relevance, proximity, link analysis, efficiency, scalability, summarization, evaluation) before 1998 but this paper showed how to make it work on the web. For any non-IR engineer attempting to build a web-based retrieval system from scratch, this must be the first port of call."

Lecture 13

(Slides)

Assignment 02 is due

Lecture 14

(Slides)

Lecture 15

(Slides)

Assignment 03 is due

Week 06
November 3
4
5
6
7

Scoring, Term Weighting and the Vector Space model

Lecture 16

(Slides)

Quiz 03

Lecture 17

(Slides)

Lecture 18

(Slides)

Mid-Term Evaluation Due

Week 07
10
11
12
13
14

 

Matrix decompositions and latent semantic indexing

  1. Textbook Chapter 18 : Matrix Decompositions and latent semantic indexing
  2. "Indexing by latent semantic analysis" by (Deerwester, Dumais, et.al)
    1. Commentary: " IR, as a field, hasn’t directly considered the issue of semantic knowledge representation. The above paper is one of the few that does in the following way. LSI is latent semantic analysis (LSA) applied to document retrieval. LSA is actually a variant of a growing ensemble of cognitively-motivated models referred to by the term “semantic space”. LSA has an encouraging track record of compatibility with human information processing across a variety of information processing tasks. LSA seems to capture the meaning of words in a way which accords with the representations we carry around in our heads. Finally, the above paper is often cited and interest in LSI seems to have increased markedly in recent years. The above paper has also made an impact outside our field. For example, recent work on latent semantic kernels (machine learning) draws heavily on LSI. "

Lecture 19

(Slides)

Assignment 04 is due

 

Lecture 20

(Slides)

 

Lecture 21

(Slides)

 

 

Week 08
17
18
19
20
21

Link analysis

  1. Textbook Chapter 21 : Link Analysis
  2. "Authoritative sources in a hyperlinked environment" by (Kleinberg)
  3. Commentary: "Kleinberg’s work on hubs and authorities was a sem- inal paper in showing how the information inherent in the underly- ing network structure of the web could be exploited. Kleinberg bases his model on the authorities for a topic, and on hubs – pages that link to a large number of thematically related authorities. He observes that hubs are in equilibrium with, and confer authority on, the sites to which they link, that is, they have a mutually reinforcing relationship. This work was significant in providing an algorithmic approach to quantifying the quality of web pages, a key issue in the web environment where the massive size of the database, informa- tion redundancy and the uncertain quality and source of informa- tion make retrieval difficult.

 

 

Lecture 22

(Slides)

 

Lecture 23

(Slides)

 

Lecture 24

(Slides)

(Demo Slides:Flash,PDF,QT)

Week 09
24
25
26
27
28

Evaluation in Information Retrieval

  1. Textbook Chapter 8 : Evaluation in Information Retrieval
  2. "A re-examination of relevance: Toward a dynamic, situational
    definition
    " (Schamber, Eisenberg, Nilan)
    1. Commentary: "This landmark paper initiated the wave of relevance research to come during the next 13 years. It re-examined the liter- ature made during 30 years, relying on the central works by Cuadra and Katter (1967), Rees and Schultz (1967), Cooper (1971), Wilson (1973), and Saracevic (1975). Essentially, the conclusions were as follows. (1) Relevance is a multidimensional cognitive concept. Its meaning is largely dependent on searchers’ perceptions of infor- mation and their own information need situations. (2) Relevance assessments have multidimensional characteristics; Relevance is a dynamic concept. It can take many meanings, such as topically ad- equate, usefulness, or satisfaction. But relevance is also dynamic as assessments of objects may change over time. (3) Relevance is a complex but systematic and measurable phenomenon – if ap- proached conceptually and operationally from the searchers’ per- spective. Schamber et al. [1990] stressed the importance of context and situation. They re-introduced the concept of “situational” rel- evance derived from Patrick Wilson’s concept in 1973, originating from Cooper (1971). Context may come from the information ob- jects or knowledge sources in systems, but may also be part of the actual information-seeking situation. Two lines of relevance re- search very fast followed the suggestions and conclusions in this paper. One track pursued the theoretical developments of relevance types, criteria and measurements, thereby bridging over to labora- tory IR evaluations. The other line of research consists of empirical studies involving searchers in realistic settings."

 

 

Lecture 25

Guest Lecture on Sourcerer Source Code Search Engine

(Slides)

 

Lecture 26

(Slides: PDF, QT)

Assignment 05 is due

(create a postings list)

 

 

 

Thanksgiving Holiday

Week 10
December 1
2
3
4
5

Evaluation in IR

(continued)

Lecture 27

Notes

Slides:PDF, QT

 

Lecture 28

Slides:

PDF, QT

Assignment 06 is due

(rapid cosine querying)

Lecture 29

Slides:

PDF, QT

donut
Finals Week
8
9
10
11
12

Final Exam Slot

Google Tour

Meet at 10:30am at the University Club Parking lot.

Google from 11am - 1 pm

Final Project

Last Chance Deadline