"RelDC Engine" Source Code

Last updated: 2/14/2013

Introduction

RelDC is an Entity Resolution algorithm that leverages relationships for disambiguation. The traditional approach for entity resolution uses "features" associated with a reference/record/object to find references that co-refer, that is, refer to the same object. For some domains using features only could be sufficient to get high quality. For others, using "features" only are not enough and much higher disambiguation quality could be achieved by using additional sources of information.

In our larger Project SHERLOCK @ UCI we have studied which other sources and types of information could be used, in addition to just features, to better disambiguate among references. This information could be present in the dataset being cleaned itself (e.g., deeper context, long chains of inter-object relationships) or can be obtained from external data sources, including ontologies, encyclopedias, and the Web.

As part of Project SHERLOCK @ UCI, we have pioneered a novel entity resolution methodology called Relationship-Based Data Cleaning (RelDC). RelDC relies upon the observation that many real-world datasets are relational in nature and contain not only information about entities (and their "features") but also relationships among them, knowledge of which can be used to disambiguate among representations more effectively. RelDC is a principled, domain-independent framework that exploits the entity-relationship graph of the dataset, and specifically relationships, for high-quality entity resolution that is self-tuning and requires minimal intervention by analysts.

The source code below is what we internally call "RelDC Engine". It is a bare bones implementation of the iterative and most basic version of RelDC. The main purposes of the code were:

  • It has allowed us to build more advanced algorithms on top of it, by extending the RelDC Engine code.
  • To show that using relationships only (the code largely ignores "features", except for blocking) can result in a major quality improvement for certain domains.
  • To show that a very efficient implementation of RelDC is possible. This is since the RelDC algorithm contains a part that is difficult to scale: discovering all paths between many pairs of nodes in a large entity-relationship graph. One of the key values of RelDC Engine code is that it contains algorithms that solve this problem efficiently for you. That is why we have used it as an "engine" to drive our other approaches.
Hence, RelDC Engine code can be very useful to those who want to build other entity-resolution techniques that leverage relationships.

How to Cite

When using our "RelDC Engine" code please cite it as:

  1. Domain-independent data cleaning via analysis of entity-relationship graph.
    Dmitri V. Kalashnikov and Sharad Mehrotra.
    In ACM Transactions on Database Systems (ACM TODS), 31(2):716-767, June 2006
    [Download Paper]

  2. Exploiting relationships for domain-independent data cleaning.
    Dmitri V. Kalashnikov, Sharad Mehrotra, and Zhaoqi Chen.
    In Proc. of SIAM International Conference on Data Mining (SIAM Data Mining), April 21-23, 2005.
    [Download Paper]

The above publications describe RelDC engine in detail. BibTeX entries for these publications are:
@article{TODS06::dvk,
   author    = {Dmitri V.\ Kalashnikov and Sharad Mehrotra},
   title     = {Domain-independent data cleaning via 
                analysis of entity-relationship graph},
   journal   = {{ACM Transactions on Database Systems (ACM TODS)}},
   volume    = 31, number = 2, pages = {716--767}, month = jun, year = 2006
}
@inproceedings{SDM05::dvk,
   author    = {Dmitri V. Kalashnikov and Sharad Mehrotra and Zhaoqi Chen},                 
   title     = {Exploiting relationships for domain-independent data cleaning},
   booktitle = {SIAM International Conference on Data Mining (SIAM SDM)},
   year      = {2005}, month = {April 21--23}, address = {Newport Beach, CA, USA}
} 

Downloading Code

  • RelDC code can de downloaded from here: [RelDC_code.zip] [License] [RelDC_data.zip]
  • RelDC is implemented in C++
  • The code is designed for UNIX in general
  • The code has been tested under Solaris, Linux, and Mac OS X
  • GCC 4.7 has been used to compile the code.
  • Code generated by GCC is faster than that by the default compiler in Mac OS X. Please use the latest GCC to compile the code.
  • Compiling Code

  • Unzip RelDC_code.zip file. The code will be inside RelDC-Gen folder. The main file is main.cpp.
  • Unzip the sample datasets file. The resulting folder is datasets.
  • Edit RelDC-Gen\config\Config.xml file to set/adjust the desired parameters for the program.
  • Edit ./mak batch file: change the path of GCC's C++ compiler (g++) to where it is located in your system.
  • To compile, run ./mak inside RelDC-Gen folder.
  • Compilation will produce executable file called ./run.
  • Running Code

    ./run



    Back to Kalashnikov's homepage

    Copyright © 2013 Dmitri V. Kalashnikov. All Rights Reserved.