CS 262P - Text Processing and Information Retrieval

Syllabus

Michael T. Goodrich

Course Description. Techniques for text pattern matching and algorithms for the storage, retrieval, and classification of textual and multimedia data. Topics include string searching, string data structures, document and website retrieval, and search engines.

Course notes and lecture slides: https://www.ics.uci.edu/~goodrich/teach/cs262P/notes.html

Coursework. Coursework will consist of in-class quizzes and programming projects. The final grade will be computed based on 30% for quizzes and 70% for projects. The lowest two quiz scores will be dropped in computing the overall grade.

Academic honesty policy. Collaboration on quizzes and projects is not allowed. Each quiz and project must be an individual effort. Working with others will be considered cheating. Likewise, the use of generative A.I. tools for any purpose other than to improve writing (e.g., spelling and grammar) is prohibited unless specifically allowed. In addition to the procedures of the ICS Cheating Policy, students caught cheating will be given a failing grade in the quiz or project in question.

Laptop policy. Open laptop computers are not allowed during in-class lectures, unless approved by the instructor or the Disabled Student Center (DSC). Laptop computers and smartphones may be used during in-class online quizzes, however.

Late policy. Late projects assignments will be graded 20% off for every day late.

Recommended Textbooks:

  • Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, by Dan Gusfield
  • Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze

Tentative Schedule

  1. Applications and Regular expression searching
  2. Finite state machines and Regular expression matching
  3. Exact matching
  4. Wildcard matching, convolutions
  5. Suffix trees and arrays
  6. Edit distance algorithms
  7. Search engines
  8. Signature files and Vector Space model
  9. Large Language Models (LLMs) and Retrieval-augmented Generation (RAG)
  10. Invertable Bloom Lookup Tables (IBLTs) and cuckoo filters
Copyright © 2025 Michael T. Goodrich, as to all lectures and videos; all rights reserved. All other course content, including Powerpoint and PDF slides, assignments, and course notes, is offered according to the license for this course.