--------------------------------------------------------------------- SEPIA: Selectivity Estimation for approximate PredIcAtes The Flamingo Project on Data Cleansing Release v 1.0 September, 2006 Copyright (C) Chen Li (chenli@ics.uci.edu) Rares Vernica Department of Computer Science University of California, Irvine, CA 92697, USA See "copyright.txt" for copyright information. There is absolutely NO WARRANTY OF ANY KIND with respect to this software; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. IN NO EVENT WILL ANY PARTY BE LIABLE TO ANYONE FOR ANY DAMAGES ARISING OUT OF THE USE OF THIS SOFTWARE, INCLUDING, WITHOUT LIMITATION, DAMAGES RESULTING FROM LOST DATA OR LOST PROFITS, OR FOR ANY SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES. The package complies to the GNU copyright terms at http://www.gnu.org/copyleft/gpl.html. We include a copy of the terms in gpl.txt in this distribution package. It also complies to the copyright terms of University of California, Irvine. The information in this software is subject to change without notice and should not be construed as a commitment by any employees of the Database Group or any other employee of University of California. --------------------------------------------------------------------- Feel free to contact the authors if you have questions about this package. As a courtesy, please send an email to the authors if you use this package. 1. Introduction Please read the following paper to get the technical details of this package on approximate string search: Liang Jin, Chen Li: Selectivity Estimation for Fuzzy String Predicates in Large Data Sets. VLDB 2005: 397-408 The goal of the package is that given an input dataset it estimates the error rate of the selectivity estimation process. The package is composed of multiple modules that build and handle a specific data structure. E.g., clustering module, PPD table module, error correction module. For each type of module there might be different implementations. Various parameters can be customized for each module. Some modules can be turned Off. 2. Requirements We tested the code under the following environment: Linux Gentoo GCC 3.4.2 Make 3.80 3. Structure SEPIA\ Cluster\ *.h *.cpp Makefile FTable\ *.h *.cpp Makefile FreqEst\ *.h *.cpp Makefile PPDTable\ *.h *.cpp Makefile SimFunc\ *.h *.cpp Makefile release\ 0\ in\ actors.lastname.100k.txt out\ clusters\ ppdtable\ query\ test\ *.h *.cpp Makefile Makefile.common Makefile.local readme.txt sepia_journal.pdf vldb06-sepia.pdf 4. Compile In order to compile we need to run "make" in the "SEPIA/" directory. This compiles all the cpp files in the "SEPIA/" directory and all subdirectories ("Cluster", "FTable", "FreqEst", "PPDTable", "SimFunc") and all their object files will be in the "SEPIA/" directory. Additionally, all the object files are linked in the "SEPIA" executable which will also be found in the "SEPIA/" directory. 5. Run In order to run the "SEPIA" executable we need to specific a few running time parameters. On a default compilation, these are: DATA_SIZE : size of the dataset. From the input dataset file, DATA_SIZE number of random elements are selected. E.g., 100000 0 - all the strings from the input file CLUSTER_NO : number of clusters to be build. E.g., 1000 SAMPLE_PERCENT : number of string to be sampled for building the PPD table. E.g., 20 SAMPLE_QUEUE : number of pivots to be chosen for each sample in order to generate PPD entries. E.g., 20 THR_MIN : minimum similarity threshold for the test predicates. E.g., 3 for Edit Distance similarity metric THR_MAX : maximum similarity threshold for the test predicates. E.g., 4 for Edit Distance similarity metric FIRST_TIME : 1 if this is the first time running the application with this settings, else 0. E.g., 1 5. Run Time Steps The default scenario run time is the following: 1. Read input dataset from file 2. Build clusters and write them to file 3. Build PPD table and write it to file 4. Build query set (strings to query for) and write it to file 5. Build predicates set (strings and similarity threshold) and it to file 6. Estimate selectivity for each of the predicates, compute the real selectivity, and compute the estimation error. 6. Compile Time Configuration The compile time configuration in done in the "Makefile" file. The "DEFS" section defines the similarity function, sampling method for PPD table, and the clustering method. Sim_Dist: 1: EditDist 2: JaccDist SampleMethod: 1: ALL_RAND 2: CLOSE_RAND 3: CLOSE_LEX 4: CLOSE_UNIQUE Cluster_Method: 1: Lexic 2: Medoids Ver 1 3: Medoids Ver 7 The "DEFS_SEPIA" section specifies is the data structures should be read, what other modules to activate, and what other information to print. READ_*: 0: build 1: read !First_Time: read only if not the "FIRST_TIME" is 0 DO_*: 0: disable module 1: activate module CLUST_* 0: suppress additional info about clusters 1: print additional info about the clusters