UCI CS295: Information quality and Entity resolution
University of California, Irvine
Spring 2013

University of California

Course Personnel

Instructor: Dmitri V. Kalashnikov
email: dvk@ics.uci.edu
office: DBH 2072
office hours: By appointment

Meeting Times & Places

First meeting (Apr 2, 2013)
time: 5:00 pm – 6:20 pm
place: DBH 1431


After that, starting Apr 11, 2013, we will meet on Thursdays only
time: Thu 5:00 - 6:20 PM
place: DBH 1431
(Notice: class at the same location, same time, but no classes on Tue)

Course Objectives

The effectiveness of data-driven technologies as decision support tools, data exploration and scientific discovery tools is closely tied to the quality of data on which such techniques are applied. It is well recognized that the outcome of the analysis is only as good as the quality of data on which the analysis is performed. That is why today organizations spend a tangible percent of their budgets on cleaning tasks such as removing duplicates, correcting errors, filling missing values, to improve data quality prior to pushing data through the analysis pipeline. Forrester Research group has estimated that the market for data quality passed the $1 Billion mark in 2008.

The objective of this course is to deepen our understanding of recent trends in information quality research, that is, this is not a comprehensive data quality course. We will focus specifically, but not exclusively, on data management techniques for solving entity resolution (ER) problem. The ER challenge arises because objects in the real world are referred to using references or descriptions that are not always unique identifiers of the objects, leading to ambiguity. This ambiguity must be resolved, or taken into account, when analyzing the data to produce meaningful results.

The course will be based on student presentations of prominent publications in the area of information quality. There will be a list of publications to be covered in the class. The students will choose publications that they want to present from that pool and they will decide the dates of their presentations. Presenting papers not from that pool is also encouraged, but please get an approval from the instructor well in advance.

Prerequisites

Basic understanding of databases and machine learning.

Textbooks

There is no required textbook. While there is no comprehensive textbook, those who are interested in furthering their knowledge of the area might want to read:
  • Data Quality and Record Linkage Techniques. T. Herzog, F. Scheuren, W. Winkler. Springer 2007
  • Exploratory Data Mining and Data Cleaning. T. Dasu, T. Johnson. John Wiley 2003
  • Principles of Data Integration. AnHai Doan, Alon Halevy, Zachary Ives. Morgan Kaufmann 2012


Syllabus

Date Topic Presenter
Apr 2  Introduction [Slides] Instructor
Apr 11 1. Joint entity resolution on multiple datasets. VLDB Journal 2013. Yasser
Apr 18 2. LINDA: Distributed Web-of-Data-Scale Entity Matching. CIKM 2012 Garrett
Apr 25    3. Pay-As-You-Go ER. TKDE 2013. Arun
May 2 4. Question Selection for Crowd Entity Resolution. PVLDB 2013 Yen
May 9 5. CrowdER: Crowdsourcing Entity Resolution. VLDB 2012 Tejas
May 16 6a. Record Matching over Query Results from Multiple Web Databases. TKDE 2012
6b. KORE: Keyphrase Overlap Relatedness for Entity Disambiguation. CIKM 2012
Bharath
Mehdi
May 23 7a. Large-Scale Collective Entity Matching. VLDB 2011. Sunil
May 30 8a. Load Balancing for MapReduce-based Entity Resolution. ICDE 2012
8b. On Active Learning of Record Matching Packages. SIGMOD 2010.
Karan
Vahe
Jun 4 7b. Frameworks for entity matching: A comparison. Data and Knowledge Engineering 2010. Kartik
Jun 6 9a. Entity resolution with iterative blocking. SIGMOD 2009.
9b. Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models. ACL HLT 2011
Kaiser
Mengfan



Tentative List of Publications

This list was created for your convenienence. Please feel free to present a paper not from that list, but first get an approval of the paper you choose from the instructor. The paper you choose must be on the course topic of Entity Resolution or Data Quality.

  • A Blocking Framework for Entity Resolution in Highly Heterogeneous Information Spaces. TKDE 2012
  • Reasoning about record matching rules. VLDB 2009.
  • Group linkage. ICDE 2007.
  • Parallel linkage. CIKM 2007.
  • Adaptive sorted neighborhood methods for efficient record linkage. JCDL 2007.


  • Midterm & Final Exam

    None.

    Grading Criteria

    In assigning the final grade the following factors will be considered:
    • (30%) Participation. Please do not sit quietly all the time!
    • (50%) Quality of presentations and slides.
    • (20%) Attendance.


    Prominent Active IQ/ER Research Groups

    Some prominent entity resolution & data quality research groups and projects:
    © 2013 Dmitri V. Kalashnikov. All Rights Reserved.

    img