---------------------------------------------------------------------
              Supporting Efficient Approximate String Searches
           The Flamingo Project on Data Cleansing
                Release v 1.1, 
                April 20, 2004

Copyright (c) 2004 by Database Group
Department of Information and Computer Science 
University of California, Irvine
Irvine, CA 92697
  
Liang Jin			liangj AT ics DOT uci DOT edu
Chen Li				chenli AT ics DOT uci DOT edu
Michael Ortega-Binderberger	miki AT ics DOT uci DOT edu
University of California, Irvine

This software was created by members of the Database Group at UC
Irvine and is distributed free of charge. It is placed in the public
domain and permission is granted for anyone to use, duplicate, modify
and redistribute it provided this notice is attached.
  
There is absolutely NO WARRANTY OF ANY KIND with respect to this
software; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE. IN NO EVENT WILL ANY PARTY BE LIABLE
TO ANYONE FOR ANY DAMAGES ARISING OUT OF THE USE OF THIS SOFTWARE,
INCLUDING, WITHOUT LIMITATION, DAMAGES RESULTING FROM LOST DATA OR
LOST PROFITS, OR FOR ANY SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES.

The package complies to the GNU copyright terms at
http://www.gnu.org/copyleft/gpl.html. We include a copy of the terms
in gpl.txt in this distribution package. It also complies to the
copyright terms of University of California, Irvine.
  
The information in this software is subject to change without notice
and should not be construed as a commitment by any employees of the
Database Group or any other employee of University of California.
---------------------------------------------------------------------

As a courtesy, please send an email to the authors if you use this
package.

- Overview

  Please read the following paper to get the technical details of this
  package on efficient record linkage.

    "Efficient Record Linkage in Large Data Sets." Liang Jin, Chen Li,
    and Sharad Mehrotra. DASFAA, March 2003.

  This program takes in two sets of strings as input, and outputs the
  string pairs whose edit distance is within a given certain
  threshold, e.g., 2.

- History of the package:

  o The R-tree code was obtained under the GNU license.  The files
    are: index.cpp, index.h, node.cpp, rect.cpp, split_q.h,
    split_q.cpp, split_l.h, split_l.cpp.

  o Michael Ortega-Binderberger (miki@ics.uci.edu) added the R-tree
    search and join functions in those files.
  
  o Liang Jin (liangj@ics.uci.edu) and Chen Li (chenli@ics.uci.edu)
    added files fastmap.cpp and fastmap.h to implement the stringmap
    algorithm to map strings to a Euclidean space and do a join.

  o The files distance.cpp and distance.h were downloaded from Michael
    Gilleland (http://www.merriampark.com/ld.htm) to calculate the
    edit distance between two strings.

  We include the orignal README file of the R-tree code in our package
  as README-RTREE.txt .
  
- Major changes since the release v 0.1
  
  o The new program accepts two source files of strings, and performs
    a join on those two sources.
    
  o We added a sampling module to estimate the new threshold in the mapped
    Euclidean space, given the orignal edit distance threshold.
    
  o We removed an unused queue in the join phase, thus reduced the memory
    requirement.
    
  o The new program supports the case where we want find strings in the
    second set that are similar to a query string (in the first data set).

- File Structure:

  The software contains two major parts: StringMap and an R-tree join.

   o The files "fastmap.h" and "fastmap.cpp" implement the stringmap
     algorithm, which maps the strings into objects in a
     high-dimensional space.

   o The files "index.h" and "index.cpp" define and construct the
     R-tree structures for the two sets of mapped objects.  The files
     also implement the join algorithm between the two R-trees.

  The "main()" function is in the test.cpp file.
  

- Important Parameters:

  You should set the right parameters for this program in the file
  parameters.h.

  NUMDIMS 20 		   -- dimensionality of the mapped objects in the
                              Euclidean space 
  MAXLEN 100               -- the maximal length of the input strings                         
  REALDELTA 2		   -- threshold in the old Edit Distance space
  SIZE1  2000		   -- total # of strings in the first dataset
  SIZE2  2000              -- total # of strings in the second dataset
  STR_SRC1 "source1.txt"   -- source file to contain dataset 1
  STR_SRC2 "source2.txt"   -- source file to contain dataset 2
  RESULTFILE "result.txt"  -- file to contain the final result (id pairs)

  We provide a sample dataset within this distribution package. It
  consists of 2,000 names each in the files "source1.txt" and "source2.txt". 
  If you use the default parameters in the file "parameters.h", then the 
  goal is to join the first string dataset with the second one, and find 
  all the string pairs whose edit distance is within 2.  Our approach first 
  maps the strings into a 20-d Euclidean space. Two R-trees are constructed 
  from the mapped objects. And the program will sample from the two datasets 
  to get a new threshold to be used in the Euclidean space. Then we do a 
  join between the two R-trees to find object pairs whose Euclidean distance 
  is within the new threshold. If the user wants to run a single k-nn query,
  instead of a k-nn join, he can put the query string into one of the source
  files, say "source1.txt", and set the corresponding size, say SIZE1, as 1.
  

- Compile and Run the Package:

  The package can compile and run under a Unix environment using the
  GNU g++ compiler.  It can also run under the CYGWIN in Windows.  Use
  the following commands to compile and run the package:

     prompt> make
     prompt> ./RTree_test

  Each string in the source file is assigned a unique ID starting from
  0.  As the program is running, the result pairs will be stored in
  the file "result.txt". 

  We ran this package on a PC running cygwin under Windows 2000. The
  PC has a 1.4G AMD CPU, 512M Memory, and a 80G hard disk.  In this
  setting, it took about 75 seconds to run the join for the sample
  dataset.

Feel free to contact the authors if you have questions about this
package.