Weekly Materials |
M |
T |
W |
R |
F |
Week 01 |
|
30 |
October 1 |
2 |
3 |
Web Search Basics
- Wikipedia entry on Vannevar Bush
- "As We May Think" The Atlantic Monthly, July, 1945. (reprinted in ACM CHI Interactions, March 1996)
- Textbook Chapter 19: Web Search Basics
- "Simple, Proven Approaches to Text Retrieval" by Robertson and Jones
- Commentary: "This paper provides a brief but well informed and technically accurate overview of the state of the art in text retrieval, at least up to 1997. It introduces the ideas of terms and matching, term weighting strategies, relevance weighting, a little on data structures and the evidence for their effectiveness. In my view it does an exemplary job of introducing the terminology of IR and the main issues in text retrieval for a numerate and technically well informed audience. It also has a very well chosen list of references."
|
Lecture 01
(Notes,Slides)
|
|
Lecture 02
(Notes, Slides) |
|
Lecture 03
(Notes,Slides)
Assignment 01 is due |
Week 02 |
6 |
7 |
8 |
9 |
10 |
Web Search Basics
(continued) |
Lecture 04
(Slides)
|
|
Lecture 05
(Notes,Slides)
|
|
Lecture 06
(Notes, Discussion)
|
Week 03 |
13 |
14 |
15 |
16 |
17 |
Web Crawling and Indices
- Textbook Chapter 20 : Web Crawling and Indices
- "Stuff I’ve seen: A system for personal information retrieval and re-use " by (S. Dumais, E. Cutell, J. Cadiz, G. Jancke, R. Sarin, and D. Robbins, SIGIR, 2003)
- Commentary: "This paper addresses an increasingly important problem – how to search and manage personal collections of electronic information. ... it addresses an important user-centered problem. ...this paper presents a practical user interface to make the system useful. ..., the paper includes large scale, user-oriented testing that demonstrates the efficacy of the system. ..., the evaluation uses both quantitative and qualitative data to make its case. I think this paper is destined to be a classic because it may eventually define how people manage their files for a decade. Moreover, it is well-written and can serve as a good model for developers doing system design and evaluation, and for students learning about IR systems and evaluation."
|
Lecture 07
(Notes,Slides) |
|
Lecture 08
(Slides)
|
|
Lecture 09
(Slides)
|
Week 04 |
20 |
21 |
22 |
23 |
24 |
Index Construction
- Textbook Chapter 4 : Index Construction
- "The WebGraph Framework I: Compression Techniques " by (P. Boldi and S. Vigna, WWW 2004)
- Abstract: "Studying web graphs is often difficult due to their large size. Recently,several proposals have been published about various techniques that allow to store a web graph in memory in a limited space, exploiting the inner redundancies of the web. The WebGraph framework is a suite of codes, algorithms and tools that aims at making it easy to manipulate large web graphs. This papers presents the compression techniques used in WebGraph, which are centred around referentiation and intervalisation (which in turn are dual to each other). WebGraph can compress the WebBase graph (118 Mnodes, 1 Glinks)in as little as 3.08 bits per link, and its transposed version in as littleas 2.89 bits per link.
- "The Web As a Graph" by R. Kumar, P Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, E. Upfal, PODS 2000)
- Abstract: "The pages and hyperlinks of the World-Wide Web may be viewed as nodes and edges in a directed graph. This graph has about a billion nodes today, several billion links, and appears to grow exponentially with time. There are many reasons—mathematical, sociological, and commercial—for studying the evolution of this graph. We first review a set of algorithms that operate on the Web graph, addressing problems from Web search, automatic community discovery, and classification. We then recall a number of measurements and properties of the Web graph. Noting that traditional random graph models do not explain these observations, we propose a new family of random graph models."
|
Lecture 10
(Slides)
|
|
Lecture 11
(Slides) |
|
Lecture 12
Web Crawling Follow-up
(Slides)
Web User Follow-up
(Slides)
|
Week 05 |
27 |
28 |
29 |
30 |
31 |
Querying, Scoring
- Textbook Chapter 1 : Boolean Retrieval
- Textbook Chapter 6 : Scoring, term weighting & the vector space model
- "The Anatomy of a Large-Scale Hypertextual Web Search Engine" by (S. Brin and L. Page, WWW1998)
- Commentary: "This paper (and the work it reports) has had more impact on everyday life than any other in the IR area. A major contribution of the paper is the recognition that some relevant search results are greatly more valued by searchers than others. By reflecting this in their evaluation procedures, Brin and Page were able to see the true value of web-specific methods like anchor text. The paper presents a highly efficient, scalable implementation of a ranking method which now delivers very high quality results to a billion people over billions of pages at about 6,000 queries per second. It also hints at the technology which Google users now take for granted: spam rejection, high speed query-based summaries, source clustering, and context(location)-sensitive search. IR and bibliometrics researchers had done it all (relevance, proximity, link analysis, efficiency, scalability, summarization, evaluation) before 1998 but this paper showed how to make it work on the web. For any non-IR engineer attempting to build a web-based retrieval system from scratch, this must be the first port of call."
|
Lecture 13
(Slides)
|
|
Lecture 14
(Slides) |
|
Lecture 15
(Slides)
|
Week 06 |
November 3 |
4 |
5 |
6 |
7 |
Scoring, Term Weighting and the Vector Space model |
Lecture 16
(Slides)
|
|
Lecture 17
(Slides) |
|
Lecture 18
(Slides)
|
Week 07 |
10 |
11 |
12 |
13 |
14 |
Matrix decompositions and latent semantic indexing
- Textbook Chapter 18 : Matrix Decompositions and latent semantic indexing
- "Indexing by latent semantic analysis" by (Deerwester, Dumais, et.al)
- Commentary: " IR, as a field, hasn’t directly considered the issue of
semantic knowledge representation. The above paper is one of the
few that does in the following way. LSI is latent semantic analysis
(LSA) applied to document retrieval. LSA is actually a variant of a
growing ensemble of cognitively-motivated models referred to by
the term “semantic space”. LSA has an encouraging track record of
compatibility with human information processing across a variety
of information processing tasks. LSA seems to capture the meaning
of words in a way which accords with the representations we carry
around in our heads. Finally, the above paper is often cited and
interest in LSI seems to have increased markedly in recent years.
The above paper has also made an impact outside our field. For
example, recent work on latent semantic kernels (machine learning)
draws heavily on LSI. "
|
Lecture 19
(Slides)
|
|
Lecture 20
(Slides)
|
|
Lecture 21
(Slides)
|
Week 08 |
17 |
18 |
19 |
20 |
21 |
Link analysis
- Textbook Chapter 21 : Link Analysis
- "Authoritative sources in a hyperlinked environment" by (Kleinberg)
- Commentary: "Kleinberg’s work on hubs and authorities was a sem-
inal paper in showing how the information inherent in the underly-
ing network structure of the web could be exploited. Kleinberg
bases his model on the authorities for a topic, and on hubs – pages
that link to a large number of thematically related authorities. He
observes that hubs are in equilibrium with, and confer authority on,
the sites to which they link, that is, they have a mutually reinforcing
relationship. This work was significant in providing an algorithmic
approach to quantifying the quality of web pages, a key issue in the
web environment where the massive size of the database, informa-
tion redundancy and the uncertain quality and source of informa-
tion make retrieval difficult.
|
Lecture 22
(Slides)
|
|
Lecture 23
(Slides)
|
|
Lecture 24
(Slides)
(Demo Slides:Flash,PDF,QT) |
Week 09 |
24 |
25 |
26 |
27 |
28 |
Evaluation in Information Retrieval
- Textbook Chapter 8 : Evaluation in Information Retrieval
- "A re-examination of relevance: Toward a dynamic, situational
definition" (Schamber, Eisenberg, Nilan)
- Commentary: "This landmark paper initiated the wave of relevance
research to come during the next 13 years. It re-examined the liter-
ature made during 30 years, relying on the central works by Cuadra
and Katter (1967), Rees and Schultz (1967), Cooper (1971), Wilson
(1973), and Saracevic (1975). Essentially, the conclusions were as
follows. (1) Relevance is a multidimensional cognitive concept. Its
meaning is largely dependent on searchers’ perceptions of infor-
mation and their own information need situations. (2) Relevance
assessments have multidimensional characteristics; Relevance is a
dynamic concept. It can take many meanings, such as topically ad-
equate, usefulness, or satisfaction. But relevance is also dynamic
as assessments of objects may change over time. (3) Relevance
is a complex but systematic and measurable phenomenon – if ap-
proached conceptually and operationally from the searchers’ per-
spective. Schamber et al. [1990] stressed the importance of context
and situation. They re-introduced the concept of “situational” rel-
evance derived from Patrick Wilson’s concept in 1973, originating
from Cooper (1971). Context may come from the information ob-
jects or knowledge sources in systems, but may also be part of the
actual information-seeking situation. Two lines of relevance re-
search very fast followed the suggestions and conclusions in this
paper. One track pursued the theoretical developments of relevance
types, criteria and measurements, thereby bridging over to labora-
tory IR evaluations. The other line of research consists of empirical
studies involving searchers in realistic settings."
|
Lecture 25
Guest Lecture on Sourcerer Source Code Search Engine
(Slides)
|
|
Lecture 26
(Slides: PDF, QT)
|
Thanksgiving Holiday
|
Week 10 |
December 1 |
2 |
3 |
4 |
5 |
Evaluation in IR
(continued) |
Lecture 27
Notes
Slides:PDF, QT
|
|
Lecture 28
Slides:
PDF, QT
|
|
Lecture 29
Slides:
PDF, QT
|
Finals Week |
8 |
9 |
10 |
11 |
12 |
|
|
|
Final Exam Slot
Google Tour
Meet at 10:30am at the University Club Parking lot.
Google from 11am - 1 pm
|
|
|