Exploiting relationships for domain-independent data cleaning.Appeared in SIAM Data Mining (SDM) 2005 ConferenceDmitri V. Kalashnikov, Sharad Mehrotra, and Zhaoqi Chen
Computer Science Department AbstractIn this paper we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which each description corresponds. The key difference between the approach we propose and the traditional techniques is that our approach analyzes not only object features but also inter-object relationships to improve the disambiguation quality. Our extensive experiments over two real data sets and also over synthetic datasets show that analysis of relationships significantly improves quality of the result. Categories and Subject Descriptors:H.2.m [Database Management]: Miscellaneous - Data cleaning;H.2.8 [Database Management]: Database Applications - Data mining; H.2.5 [Information Systems]: Heterogeneous Databases; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval Keywords:GDF, relationship-based data cleaning, reference disambiguation, record linkage, data mining, iterative data cleaning
Downloadable files:Paper: SDM05_dvk.12page.pdfPaper: TODS06 (extended version 1) Paper: SDM05_TR (extended version 2) Presentation: SDM05_dvk.ppt Source Code: Code BibTeX entry:@inproceedings{SDM05::dvk, author = {Dmitri V. Kalashnikov and Sharad Mehrotra and Zhaoqi Chen}, title = {Exploiting relationships for domain-independent data cleaning}, booktitle = {SIAM International Conference on Data Mining (SIAM SDM)}, year = {2005}, month = {April 21--23}, address = {Newport Beach, CA, USA} }Back to Kalashnikov's homepage |