--------------------------------------------------------------------- Supporting Efficient Approximate String Searches The Flamingo Project on Data Cleansing Release v 2.0 May, 2006 Copyright (C) Chen Li (chenli@ics.uci.edu) Liang Jin Department of Computer Science University of California, Irvine, CA 92697, USA See "copyright.txt" for copyright information. There is absolutely NO WARRANTY OF ANY KIND with respect to this software; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. IN NO EVENT WILL ANY PARTY BE LIABLE TO ANYONE FOR ANY DAMAGES ARISING OUT OF THE USE OF THIS SOFTWARE, INCLUDING, WITHOUT LIMITATION, DAMAGES RESULTING FROM LOST DATA OR LOST PROFITS, OR FOR ANY SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES. The package complies to the GNU copyright terms at http://www.gnu.org/copyleft/gpl.html. We include a copy of the terms in gpl.txt in this distribution package. It also complies to the copyright terms of University of California, Irvine. The information in this software is subject to change without notice and should not be construed as a commitment by any employees of the Database Group or any other employee of University of California. --------------------------------------------------------------------- Feel free to contact the authors if you have questions about this package. As a courtesy, please send an email to the authors if you use this package. - Overview Please read the following paper to get the technical details of this package on approximate string search: Efficient Record Linkage in Large Data Sets, Liang Jin, Chen Li, and Sharad Mehrotra, in the 8th International Conference on Database Systems for Advanced Applications (DASFAA 2003), 26 - 28 March, 2003, Kyoto, Japan This package supports fuzzy string searching, i.e., finding strings similar to a given a string or a collection of strings, where similarity is defined using edit distance. It supports both join queries and sing-string-search queries. The approach first maps strings into a high-dimensional Euclidean space. An R-tree is constructed from the mapped objects. The program will sample from the strings to get a new threshold to be used to do search in the Euclidean space. A similarity string search is converted to a spatial search in the new Euclidean space using R-trees. - To run the package: o make clean o make o date; time ./stringmap_unittest; date o After running it once, you can rerun it by using "./stringmap_unittest load" to save some time to do the mapping and construct the R-tree for the single-string-search case. o "stringmap_unittest.cpp" shows how to use StringMap. Some parameters can be tuned, such as the sampling percentages, which should depend on the data size. o Some R-tree parameters are defined in rtreeparams.h. e.g., the dimensionality of the Euclidean space is set to 20. This package includes three data files of names: - "source1.txt" and "source2.txt" are used for testing join queries - "source.txt" is used for testing single-string-search queries, which is a combination of source1.txt and source2.txt. - "source-big.txt" is used for testing single-string-search queries, which has more than 111K full names. We ran this package on a Linux platform with a 2.8GHZ Pentium 4 with 1GB RAM. The join on source1.txt and source2.txt (2000 x 2000 strings) took about 105 seconds (including the time to compute the recall, which was using nested loop and expensive), and with a recall of 98.9%. We ran 100 single-string-search queries on 100K name strings, and the average search time is 0.079 seconds, with the average recall 98.8%. - Major changes since the release in v1.1. o Chen Li restructured the stringmap files to make them more modualized. o Chen Li also added functions to support single-string-search queries. Have fun!