Exploiting relationships for object consolidation.

Appeared in ACM IQIS Workshop co-located with ACM SIGMOD 2005.


Zhaoqi Chen, Dmitri V. Kalashnikov, and Sharad Mehrotra

Computer Science Department
University of California, Irvine
GDF project (http://www.ics.uci.edu/~dvk/GDF)

Abstract

Data mining practitioners frequently have to spend significant portion of their project time on data preprocessing before they can apply their algorithms on real-world datasets. Such a preprocessing is required because many real-world datasets are not perfect, but rather they contain missing, erroneous, duplicate data and other data cleaning problems. It is a well established fact that, in general, if such problems with data are not corrected, applying data mining algorithm can lead to wrong results. The latter is known as the "garbage in, garbage out" principle. Given the significance of the problem, numerous data cleaning techniques have been designed in the past to address the aforementioned problems with data.

In this paper we address one of the data cleaning challenges, called object consolidation. This important challenge arises because objects in datasets are frequently represented via descriptions (a set of instantiated attributes), which alone might not always uniquely identify the object. The goal of object consolidation is to correctly consolidate (i.e., to group/determine) all the representations of the same object, for each object in the dataset. In contrast to traditional domain-independent data cleaning techniques, our approach analyzes not only object features, but also additional semantic information: inter-objects relationships, for the purpose of object consolidation. The approach views datasets as attributed relational graphs (ARGs) of object representations (nodes), connected via relationships (edges). The approach then applies graph partitioning techniques to accurately cluster object representations. Our empirical study over real datasets shows that analyzing relationships significantly improves the quality of the result.


Categories and Subject Descriptors:

H.2.m [Database Management]: Miscellaneous – Data cleaning;
H.2.8 [Database Management]: Database Applications – Data mining;
H.2.5 [Information Systems]: Heterogeneous Databases;
H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval


Keywords:

GDF, relationship-based data cleaning, object consolidation, record linkage, data mining,


Downloadable files:

Paper: IQIS05_dvk.pdf
Presentation: IQIS05_dvk.ppt

BibTeX entry:

@inproceedings{IQIS05::dvk,
   author    = {Zhaoqi Chen and Dmitri V. Kalashnikov and Sharad Mehrotra},
   title     = {Exploiting relationships for object consolidation},
   booktitle = {Proc. of International ACM SIGMOD Workshop 
                on Information Quality in Information Systems (ACM IQIS 2005)},
   year      = {2005},
   month     = {June 17}, 
   address   = {Baltimore, MD, USA}
}

Back to Kalashnikov's homepage