--------------------------------------------------------------------- Supporting Efficient Approximate String Searches The Flamingo Project on Data Cleansing Release v 1.1, April 20, 2004 Copyright (c) 2004 by Database Group Department of Information and Computer Science University of California, Irvine Irvine, CA 92697 Liang Jin liangj AT ics DOT uci DOT edu Chen Li chenli AT ics DOT uci DOT edu Michael Ortega-Binderberger miki AT ics DOT uci DOT edu University of California, Irvine This software was created by members of the Database Group at UC Irvine and is distributed free of charge. It is placed in the public domain and permission is granted for anyone to use, duplicate, modify and redistribute it provided this notice is attached. There is absolutely NO WARRANTY OF ANY KIND with respect to this software; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. IN NO EVENT WILL ANY PARTY BE LIABLE TO ANYONE FOR ANY DAMAGES ARISING OUT OF THE USE OF THIS SOFTWARE, INCLUDING, WITHOUT LIMITATION, DAMAGES RESULTING FROM LOST DATA OR LOST PROFITS, OR FOR ANY SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES. The package complies to the GNU copyright terms at http://www.gnu.org/copyleft/gpl.html. We include a copy of the terms in gpl.txt in this distribution package. It also complies to the copyright terms of University of California, Irvine. The information in this software is subject to change without notice and should not be construed as a commitment by any employees of the Database Group or any other employee of University of California. --------------------------------------------------------------------- As a courtesy, please send an email to the authors if you use this package. - Overview Please read the following paper to get the technical details of this package on efficient record linkage. "Efficient Record Linkage in Large Data Sets." Liang Jin, Chen Li, and Sharad Mehrotra. DASFAA, March 2003. This program takes in two sets of strings as input, and outputs the string pairs whose edit distance is within a given certain threshold, e.g., 2. - History of the package: o The R-tree code was obtained under the GNU license. The files are: index.cpp, index.h, node.cpp, rect.cpp, split_q.h, split_q.cpp, split_l.h, split_l.cpp. o Michael Ortega-Binderberger (miki@ics.uci.edu) added the R-tree search and join functions in those files. o Liang Jin (liangj@ics.uci.edu) and Chen Li (chenli@ics.uci.edu) added files fastmap.cpp and fastmap.h to implement the stringmap algorithm to map strings to a Euclidean space and do a join. o The files distance.cpp and distance.h were downloaded from Michael Gilleland (http://www.merriampark.com/ld.htm) to calculate the edit distance between two strings. We include the orignal README file of the R-tree code in our package as README-RTREE.txt . - Major changes since the release v 0.1 o The new program accepts two source files of strings, and performs a join on those two sources. o We added a sampling module to estimate the new threshold in the mapped Euclidean space, given the orignal edit distance threshold. o We removed an unused queue in the join phase, thus reduced the memory requirement. o The new program supports the case where we want find strings in the second set that are similar to a query string (in the first data set). - File Structure: The software contains two major parts: StringMap and an R-tree join. o The files "fastmap.h" and "fastmap.cpp" implement the stringmap algorithm, which maps the strings into objects in a high-dimensional space. o The files "index.h" and "index.cpp" define and construct the R-tree structures for the two sets of mapped objects. The files also implement the join algorithm between the two R-trees. The "main()" function is in the test.cpp file. - Important Parameters: You should set the right parameters for this program in the file parameters.h. NUMDIMS 20 -- dimensionality of the mapped objects in the Euclidean space MAXLEN 100 -- the maximal length of the input strings REALDELTA 2 -- threshold in the old Edit Distance space SIZE1 2000 -- total # of strings in the first dataset SIZE2 2000 -- total # of strings in the second dataset STR_SRC1 "source1.txt" -- source file to contain dataset 1 STR_SRC2 "source2.txt" -- source file to contain dataset 2 RESULTFILE "result.txt" -- file to contain the final result (id pairs) We provide a sample dataset within this distribution package. It consists of 2,000 names each in the files "source1.txt" and "source2.txt". If you use the default parameters in the file "parameters.h", then the goal is to join the first string dataset with the second one, and find all the string pairs whose edit distance is within 2. Our approach first maps the strings into a 20-d Euclidean space. Two R-trees are constructed from the mapped objects. And the program will sample from the two datasets to get a new threshold to be used in the Euclidean space. Then we do a join between the two R-trees to find object pairs whose Euclidean distance is within the new threshold. If the user wants to run a single k-nn query, instead of a k-nn join, he can put the query string into one of the source files, say "source1.txt", and set the corresponding size, say SIZE1, as 1. - Compile and Run the Package: The package can compile and run under a Unix environment using the GNU g++ compiler. It can also run under the CYGWIN in Windows. Use the following commands to compile and run the package: prompt> make prompt> ./RTree_test Each string in the source file is assigned a unique ID starting from 0. As the program is running, the result pairs will be stored in the file "result.txt". We ran this package on a PC running cygwin under Windows 2000. The PC has a 1.4G AMD CPU, 512M Memory, and a 80G hard disk. In this setting, it took about 75 seconds to run the join for the sample dataset. Feel free to contact the authors if you have questions about this package.