--------------------------------------------------------------------- Efficient Record Linkage The Flamingo Project on Data Cleansing Release v 0.1, December 1st, 2002 Copyright (c) 2002 by Database Group Department of Information and Computer Science University of California, Irvine Irvine, CA 92697 Liang Jin liangj@ics.uci.edu Chen Li chenli@ics.uci.edu Michael Ortega-Binderberger miki@ics.uci.edu University of California, Irvine This software was created by members of the Database Group at UC Irvine and is distributed free of charge. It is placed in the public domain and permission is granted for anyone to use, duplicate, modify and redistribute it provided this notice is attached. There is absolutely NO WARRANTY OF ANY KIND with respect to this software; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. IN NO EVENT WILL ANY PARTY BE LIABLE TO ANYONE FOR ANY DAMAGES ARISING OUT OF THE USE OF THIS SOFTWARE, INCLUDING, WITHOUT LIMITATION, DAMAGES RESULTING FROM LOST DATA OR LOST PROFITS, OR FOR ANY SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES. The package complies to the GNU copyright terms at http://www.gnu.org/copyleft/gpl.html. We include a copy of the terms in gpl.txt in this distribution package. It also complies to the copyright terms of University of California, Irvine. The information in this software is subject to change without notice and should not be construed as a commitment by any employees of the Database Group or any other employee of University of California. --------------------------------------------------------------------- As a courtesy, please send an email to the authors if you use this package. - Overview Please read the following paper to get the technical details of this package on efficient record linkage. "Efficient Record Linkage in Large Data Sets." Liang Jin, Chen Li, and Sharad Mehrotra. UCI ICS Technical Report, July 2002. This program takes in two sets of strings as input, and outputs the string pairs whose edit distance is within a given certain threshold, e.g., 2. - History of the package: o The R-tree code was obtained under the GNU license. The files are: index.cpp, index.h, node.cpp, rect.cpp, split_q.h, split_q.cpp. o Michael Ortega-Binderberger (miki@ics.uci.edu) added the R-tree search and join functions in those files. o Liang Jin (liangj@ics.uci.edu) and Chen Li (chenli@ics.uci.edu) added files fastmap.cpp and fastmap.h to implement the stringmap algorithm to map strings to a Euclidean space and do a join. o The files distance.cpp and distance.h were downloaded from Michael Gilleland (http://www.merriampark.com/ld.htm) to calculate the edit distance between two strings. We include the orignal README file of the R-tree code in our package as README-RTREE.txt . - File Structure: The software contains two major parts: StringMap and an R-tree join. o The files "fastmap.h" and "fastmap.cpp" implement the stringmap algorithm, which maps the strings into objects in a high-dimensional space. o The files "index.h" and "index.cpp" define and construct the R-tree structures for the two sets of mapped objects. The files also implement the join algorithm between the two R-trees. The "main()" function is in the test.cpp file. - Important Parameters: You should set the right parameters for this program in the file parameters.h. NUMDIMS 20 -- dimensionality of the mapped objects in the Euclidean space DELTA 6.5 -- threshold in the mapped Euclidean space REALDELTA 2 -- threshold in the old Edit Distance space SIZE 4000 -- total # of strings in the two datasets STR_SRC "source.txt" -- source file to contain the two datasets RESULTFILE "result.txt" -- file to contain the final result (id pairs) We provide a sample dataset within this distribution package. It consists of 4,000 names in the file "source.txt". If you use the default parameters in the file "parameters.h", then the goal is to join the first half (2,000) strings with the second half (2,000) strings, and find all the string pairs whose edit distance is within 2. Our approach first maps the strings into a 20-d Euclidean space. Two R-trees are constructed from the mapped objects. From earlier experiments we knew that a good threshold in the mapped space is 6.5. Thus we do a join between the two R-trees to find object pairs whose Euclidean distance is 6.5. - Compile and Run the Package: The package can compile and run under a Unix environment using the GNU g++ compiler. It can also run under the CYGWIN in Windows. Use the following commands to compile and run the package: prompt> make prompt> ./RTree_test Each string in the source file is assigned a unique ID starting from 0. As the program is running, the result id pairs will be stored in the file "result.txt". You can read the source file to locate the string pairs. We ran this package on a PC running cygwin under Windows 2000. The PC has a 1.4G AMD CPU, 512M Memory, and a 80G hard disk. In this setting, it took about 75 seconds to run the join for the sample dataset. Feel free to contact the authors if you have questions about this package.