---------------------------------------------------------------------
        SEPIA: Selectivity Estimation for approximate PredIcAtes
           The Flamingo Project on Data Cleansing
                Release v 1.0
                 September, 2006

      Copyright (C) Chen Li (chenli@ics.uci.edu)
                    Rares Vernica

              Department of Computer Science 
       University of California, Irvine, CA 92697, USA

      See "copyright.txt" for copyright information.

There is absolutely NO WARRANTY OF ANY KIND with respect to this
software; without even the implied warranty of MERCHANTABILITY or
FITNESS FOR A PARTICULAR PURPOSE. IN NO EVENT WILL ANY PARTY BE LIABLE
TO ANYONE FOR ANY DAMAGES ARISING OUT OF THE USE OF THIS SOFTWARE,
INCLUDING, WITHOUT LIMITATION, DAMAGES RESULTING FROM LOST DATA OR
LOST PROFITS, OR FOR ANY SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES.

The package complies to the GNU copyright terms at
http://www.gnu.org/copyleft/gpl.html. We include a copy of the terms
in gpl.txt in this distribution package. It also complies to the
copyright terms of University of California, Irvine.
  
The information in this software is subject to change without notice
and should not be construed as a commitment by any employees of the
Database Group or any other employee of University of California.
---------------------------------------------------------------------

Feel free to contact the authors if you have questions about this package.
As a courtesy, please send an email to the authors if you use this
package.

1. Introduction

Please read the following paper to get the technical details of this
package on approximate string search:

  Liang Jin, Chen Li: Selectivity Estimation for Fuzzy String
  Predicates in Large Data Sets. VLDB 2005: 397-408

The goal of the package is that given an input dataset it estimates
the error rate of the selectivity estimation process.

The package is composed of multiple modules that build and handle a
specific data structure. E.g., clustering module, PPD table module,
error correction module. For each type of module there might be
different implementations. Various parameters can be customized for
each module. Some modules can be turned Off.

2. Requirements

We tested the code under the following environment:

Linux Gentoo
GCC 3.4.2
Make 3.80

3. Structure

SEPIA\
  Cluster\
    *.h
    *.cpp
    Makefile
  FTable\
    *.h
    *.cpp
    Makefile
  FreqEst\
    *.h
    *.cpp
    Makefile
  PPDTable\
    *.h
    *.cpp
    Makefile
  SimFunc\
    *.h
    *.cpp
    Makefile
  release\
    0\
      in\
        actors.lastname.100k.txt
      out\
        clusters\
        ppdtable\
        query\
        test\
  *.h
  *.cpp
  Makefile
  Makefile.common
  Makefile.local
  readme.txt
  sepia_journal.pdf
  vldb06-sepia.pdf

4. Compile

In order to compile we need to run "make" in the "SEPIA/"
directory. This compiles all the cpp files in the "SEPIA/" directory
and all subdirectories ("Cluster", "FTable", "FreqEst", "PPDTable",
"SimFunc") and all their object files will be in the "SEPIA/"
directory. Additionally, all the object files are linked in the "SEPIA"
executable which will also be found in the "SEPIA/" directory.

5. Run

In order to run the "SEPIA" executable we need to specific a few
running time parameters. On a default compilation, these are:
DATA_SIZE      : size of the dataset. From the input dataset file,
                 DATA_SIZE number of random elements are selected.
                 E.g., 100000
                       0 - all the strings from the input file
CLUSTER_NO     : number of clusters to be build.
                 E.g., 1000
SAMPLE_PERCENT : number of string to be sampled for building the PPD
                 table.
                 E.g., 20
SAMPLE_QUEUE   : number of pivots to be chosen for each sample in
                 order to generate PPD entries.
                 E.g., 20
THR_MIN        : minimum similarity threshold for the test predicates.
                 E.g., 3 for Edit Distance similarity metric
THR_MAX        : maximum similarity threshold for the test predicates.
                 E.g., 4 for Edit Distance similarity metric
FIRST_TIME     : 1 if this is the first time running the application
                 with this settings, else 0.
                 E.g., 1
                 
5. Run Time Steps

The default scenario run time is the following:

  1. Read input dataset from file
  2. Build clusters and write them to file
  3. Build PPD table and write it to file
  4. Build query set (strings to query for) and write it to file
  5. Build predicates set (strings and similarity threshold) and it to
     file
  6. Estimate selectivity for each of the predicates, compute the real
     selectivity, and compute the estimation error.

6. Compile Time Configuration

The compile time configuration in done in the "Makefile" file. The
"DEFS" section defines the similarity function, sampling method for
PPD table, and the clustering method.

Sim_Dist:       1: EditDist
                2: JaccDist

SampleMethod:   1: ALL_RAND
                2: CLOSE_RAND
                3: CLOSE_LEX
                4: CLOSE_UNIQUE

Cluster_Method: 1: Lexic
                2: Medoids Ver 1
                3: Medoids Ver 7

The "DEFS_SEPIA" section specifies is the data structures should be
read, what other modules to activate, and what other information to
print.

READ_*:  0: build
         1: read
         !First_Time: read only if not the "FIRST_TIME" is 0

DO_*:    0: disable module
         1: activate module

CLUST_*  0: suppress additional info about clusters
         1: print additional info about the clusters