Domain-independent data cleaning via analysis of entity-relationship graph.

ACM TODS Journal, Vol. 31(2), June 2006


Dmitri V. Kalashnikov and Sharad Mehrotra

Computer Science Department
University of California, Irvine
GDF project (http://www.ics.uci.edu/~dvk/GDF)

Abstract

In this paper, we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which each description corresponds. The key difference between the approach we propose and the traditional techniques is that our approach analyzes not only object features but also inter-object relationships to improve the disambiguation quality. Our extensive experiments over two real data sets and over synthetic datasets show that analysis of relationships significantly improves quality of the result.


Categories and Subject Descriptors:

H.2.m [Database Management]: Miscellaneous - Data cleaning;
H.2.8 [Database Management]: Database Applications - Data mining;
H.2.5 [Information Systems]: Heterogeneous Databases;
H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval


Keywords:

Connection strength, data cleaning, entity resolution, graph analysis, reference disambiguation, relationship analysis, GDF.


Downloadable files:

Paper: TODS06_dvk.pdf
Paper: TODS06_append.pdf (electronic appendix)
Short version: SDM05_dvk
Source Code: Code

BibTeX entry:

@article{TODS06::dvk,
   author    = {Dmitri V.\ Kalashnikov and Sharad Mehrotra},
   title     = {Domain-independent data cleaning via analysis of entity-relationship graph},
   journal   = {{ACM Transactions on Database Systems (ACM TODS)}},
   volume    = 31,
   number    = 2,
   pages     = {716--767},
   month     = jun, 
   year      = 2006
}


Back to Kalashnikov's homepage