Fall 2008: Computer Science 221 : Information Retrieval : Calendar

Assignment Schedule:

Weekly Materials	M	T	W	R	F
Week 01	September 29	30	October 1	2	3
Web Search Basics Wikipedia entry on Vannevar Bush "As We May Think" The Atlantic Monthly, July, 1945. (reprinted in ACM CHI Interactions, March 1996) Textbook Chapter 19: Web Search Basics "Simple, Proven Approaches to Text Retrieval" by Robertson and Jones Commentary: "This paper provides a brief but well informed and technically accurate overview of the state of the art in text retrieval, at least up to 1997. It introduces the ideas of terms and matching, term weighting strategies, relevance weighting, a little on data structures and the evidence for their effectiveness. In my view it does an exemplary job of introducing the terminology of IR and the main issues in text retrieval for a numerate and technically well informed audience. It also has a very well chosen list of references."	Lecture 01 (Notes,Slides)		Lecture 02 (Notes, Slides)		Lecture 03 (Notes,Slides) Assignment 01 is due
Week 02	6	7	8	9	10
Web Search Basics (continued)	Lecture 04 (Slides)		Lecture 05 (Notes,Slides)		Lecture 06 (Notes, Discussion) Quiz 01
Week 03	13	14	15	16	17
Web Crawling and Indices Textbook Chapter 20 : Web Crawling and Indices "Stuff I’ve seen: A system for personal information retrieval and re-use " by (S. Dumais, E. Cutell, J. Cadiz, G. Jancke, R. Sarin, and D. Robbins, SIGIR, 2003) Commentary: "This paper addresses an increasingly important problem – how to search and manage personal collections of electronic information. ... it addresses an important user-centered problem. ...this paper presents a practical user interface to make the system useful. ..., the paper includes large scale, user-oriented testing that demonstrates the efficacy of the system. ..., the evaluation uses both quantitative and qualitative data to make its case. I think this paper is destined to be a classic because it may eventually define how people manage their files for a decade. Moreover, it is well-written and can serve as a good model for developers doing system design and evaluation, and for students learning about IR systems and evaluation."	Lecture 07 (Notes,Slides)		Lecture 08 (Slides)		Lecture 09 (Slides)
Week 04	20	21	22	23	24
Index Construction Textbook Chapter 4 : Index Construction "The WebGraph Framework I: Compression Techniques " by (P. Boldi and S. Vigna, WWW 2004) Abstract: "Studying web graphs is often difficult due to their large size. Recently,several proposals have been published about various techniques that allow to store a web graph in memory in a limited space, exploiting the inner redundancies of the web. The WebGraph framework is a suite of codes, algorithms and tools that aims at making it easy to manipulate large web graphs. This papers presents the compression techniques used in WebGraph, which are centred around referentiation and intervalisation (which in turn are dual to each other). WebGraph can compress the WebBase graph (118 Mnodes, 1 Glinks)in as little as 3.08 bits per link, and its transposed version in as littleas 2.89 bits per link. "The Web As a Graph" by R. Kumar, P Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, E. Upfal, PODS 2000) Abstract: "The pages and hyperlinks of the World-Wide Web may be viewed as nodes and edges in a directed graph. This graph has about a billion nodes today, several billion links, and appears to grow exponentially with time. There are many reasons—mathematical, sociological, and commercial—for studying the evolution of this graph. We first review a set of algorithms that operate on the Web graph, addressing problems from Web search, automatic community discovery, and classification. We then recall a number of measurements and properties of the Web graph. Noting that traditional random graph models do not explain these observations, we propose a new family of random graph models."	Lecture 10 (Slides) Quiz 02		Lecture 11 (Slides)		Lecture 12 Web Crawling Follow-up (Slides) Web User Follow-up (Slides)
Week 05	27	28	29	30	31
Querying, Scoring Textbook Chapter 1 : Boolean Retrieval Textbook Chapter 6 : Scoring, term weighting & the vector space model "The Anatomy of a Large-Scale Hypertextual Web Search Engine" by (S. Brin and L. Page, WWW1998) Commentary: "This paper (and the work it reports) has had more impact on everyday life than any other in the IR area. A major contribution of the paper is the recognition that some relevant search results are greatly more valued by searchers than others. By reflecting this in their evaluation procedures, Brin and Page were able to see the true value of web-specific methods like anchor text. The paper presents a highly efficient, scalable implementation of a ranking method which now delivers very high quality results to a billion people over billions of pages at about 6,000 queries per second. It also hints at the technology which Google users now take for granted: spam rejection, high speed query-based summaries, source clustering, and context(location)-sensitive search. IR and bibliometrics researchers had done it all (relevance, proximity, link analysis, efficiency, scalability, summarization, evaluation) before 1998 but this paper showed how to make it work on the web. For any non-IR engineer attempting to build a web-based retrieval system from scratch, this must be the first port of call."	Lecture 13 (Slides) Assignment 02 is due		Lecture 14 (Slides)		Lecture 15 (Slides) Assignment 03 is due
Week 06	November 3	4	5	6	7
Scoring, Term Weighting and the Vector Space model	Lecture 16 (Slides) Quiz 03		Lecture 17 (Slides)		Lecture 18 (Slides) Mid-Term Evaluation Due
Week 07	10	11	12	13	14
Matrix decompositions and latent semantic indexing Textbook Chapter 18 : Matrix Decompositions and latent semantic indexing "Indexing by latent semantic analysis" by (Deerwester, Dumais, et.al) Commentary: " IR, as a field, hasn’t directly considered the issue of semantic knowledge representation. The above paper is one of the few that does in the following way. LSI is latent semantic analysis (LSA) applied to document retrieval. LSA is actually a variant of a growing ensemble of cognitively-motivated models referred to by the term “semantic space”. LSA has an encouraging track record of compatibility with human information processing across a variety of information processing tasks. LSA seems to capture the meaning of words in a way which accords with the representations we carry around in our heads. Finally, the above paper is often cited and interest in LSI seems to have increased markedly in recent years. The above paper has also made an impact outside our field. For example, recent work on latent semantic kernels (machine learning) draws heavily on LSI. "	Lecture 19 (Slides) Assignment 04 is due		Lecture 20 (Slides)		Lecture 21 (Slides)
Week 08	17	18	19	20	21
Link analysis Textbook Chapter 21 : Link Analysis "Authoritative sources in a hyperlinked environment" by (Kleinberg) Commentary: "Kleinberg’s work on hubs and authorities was a sem- inal paper in showing how the information inherent in the underly- ing network structure of the web could be exploited. Kleinberg bases his model on the authorities for a topic, and on hubs – pages that link to a large number of thematically related authorities. He observes that hubs are in equilibrium with, and confer authority on, the sites to which they link, that is, they have a mutually reinforcing relationship. This work was significant in providing an algorithmic approach to quantifying the quality of web pages, a key issue in the web environment where the massive size of the database, informa- tion redundancy and the uncertain quality and source of informa- tion make retrieval difficult.	Lecture 22 (Slides)		Lecture 23 (Slides)		Lecture 24 (Slides) (Demo Slides:Flash,PDF,QT)
Week 09	24	25	26	27	28
Evaluation in Information Retrieval Textbook Chapter 8 : Evaluation in Information Retrieval "A re-examination of relevance: Toward a dynamic, situational definition" (Schamber, Eisenberg, Nilan) Commentary: "This landmark paper initiated the wave of relevance research to come during the next 13 years. It re-examined the liter- ature made during 30 years, relying on the central works by Cuadra and Katter (1967), Rees and Schultz (1967), Cooper (1971), Wilson (1973), and Saracevic (1975). Essentially, the conclusions were as follows. (1) Relevance is a multidimensional cognitive concept. Its meaning is largely dependent on searchers’ perceptions of infor- mation and their own information need situations. (2) Relevance assessments have multidimensional characteristics; Relevance is a dynamic concept. It can take many meanings, such as topically ad- equate, usefulness, or satisfaction. But relevance is also dynamic as assessments of objects may change over time. (3) Relevance is a complex but systematic and measurable phenomenon – if ap- proached conceptually and operationally from the searchers’ per- spective. Schamber et al. [1990] stressed the importance of context and situation. They re-introduced the concept of “situational” rel- evance derived from Patrick Wilson’s concept in 1973, originating from Cooper (1971). Context may come from the information ob- jects or knowledge sources in systems, but may also be part of the actual information-seeking situation. Two lines of relevance re- search very fast followed the suggestions and conclusions in this paper. One track pursued the theoretical developments of relevance types, criteria and measurements, thereby bridging over to labora- tory IR evaluations. The other line of research consists of empirical studies involving searchers in realistic settings."	Lecture 25 Guest Lecture on Sourcerer Source Code Search Engine (Slides)		Lecture 26 (Slides: PDF, QT) Assignment 05 is due (create a postings list)	Thanksgiving Holiday
Week 10	December 1	2	3	4	5
Evaluation in IR (continued)	Lecture 27 Notes Slides:PDF, QT		Lecture 28 Slides: PDF, QT Assignment 06 is due (rapid cosine querying)		Lecture 29 Slides: PDF, QT
Finals Week	8	9	10	11	12
			Final Exam Slot Google Tour Meet at 10:30am at the University Club Parking lot. Google from 11am - 1 pm		Final Project Last Chance Deadline