---------------------------------------------------------------------
        Supporting Efficient Approximate String Searches
           The Flamingo Project on Data Cleansing
                Release v 2.0
                 May, 2006

      Copyright (C) Chen Li (chenli@ics.uci.edu)
                    Liang Jin

              Department of Computer Science 
       University of California, Irvine, CA 92697, USA

      See "copyright.txt" for copyright information.

There is absolutely NO WARRANTY OF ANY KIND with respect to this
software; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE. IN NO EVENT WILL ANY PARTY BE LIABLE
TO ANYONE FOR ANY DAMAGES ARISING OUT OF THE USE OF THIS SOFTWARE,
INCLUDING, WITHOUT LIMITATION, DAMAGES RESULTING FROM LOST DATA OR
LOST PROFITS, OR FOR ANY SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES.

The package complies to the GNU copyright terms at
http://www.gnu.org/copyleft/gpl.html. We include a copy of the terms
in gpl.txt in this distribution package. It also complies to the
copyright terms of University of California, Irvine.
  
The information in this software is subject to change without notice
and should not be construed as a commitment by any employees of the
Database Group or any other employee of University of California.
---------------------------------------------------------------------

Feel free to contact the authors if you have questions about this package.
As a courtesy, please send an email to the authors if you use this
package.

- Overview

  Please read the following paper to get the technical details of this
  package on approximate string search:

     Efficient Record Linkage in Large Data Sets, Liang Jin, Chen Li, and
     Sharad Mehrotra, in the 8th International Conference on Database
     Systems for Advanced Applications (DASFAA 2003), 26 - 28 March, 2003,
     Kyoto, Japan

  This package supports fuzzy string searching, i.e., finding strings
  similar to a given a string or a collection of strings, where similarity
  is defined using edit distance.  It supports both join queries and
  sing-string-search queries.  The approach first maps strings into a
  high-dimensional Euclidean space.  An R-tree is constructed from the
  mapped objects. The program will sample from the strings to get a new
  threshold to be used to do search in the Euclidean space.  A similarity
  string search is converted to a spatial search in the new Euclidean
  space using R-trees.

- To run the package:

  o make clean

  o make

  o date; time ./stringmap_unittest; date

  o After running it once, you can rerun it by using "./stringmap_unittest load"
    to save some time to do the mapping and construct the R-tree for the
    single-string-search case.

  o "stringmap_unittest.cpp" shows how to use StringMap.  Some parameters
    can be tuned, such as the sampling percentages, which should depend
    on the data size.
  o Some R-tree parameters are defined in rtreeparams.h. e.g., the
    dimensionality of the Euclidean space is set to 20.

  This package includes three data files of names:

    - "source1.txt" and "source2.txt" are used for testing join queries

    - "source.txt" is used for testing single-string-search queries, which
      is a combination of source1.txt and source2.txt.

    - "source-big.txt" is used for testing single-string-search queries,
      which has more than 111K full names.

  We ran this package on a Linux platform with a 2.8GHZ Pentium 4 with 1GB
  RAM. The join on source1.txt and source2.txt (2000 x 2000 strings) took
  about 105 seconds (including the time to compute the recall, which was
  using nested loop and expensive), and with a recall of 98.9%.  We ran
  100 single-string-search queries on 100K name strings, and the average
  search time is 0.079 seconds, with the average recall 98.8%.

- Major changes since the release in v1.1.

  o Chen Li restructured the stringmap files to make them more modualized.

  o Chen Li also added functions to support single-string-search queries.

Have fun!