Source Code | Super-EGO ε-join

Source Code of Super-EGO ε-join
Last updated: 2/13/2013
Introduction

Super-EGO is a very fast in-memory algorithm that performs a well-studied similarity join operation known as ε-join. Given two d-dimensional datasets A and B and parameter ε, the task of ε-join is to find all pairs of points (a,b), where a ∈ A and b ∈ B, such that the distance between them |a-b| ≤ ε. Most often the Euclidean distance is used to compute the distance |a-b|, or distance is defined as Lp norm. This operation is often employed in data-mining and other areas for finding all pairs of similar objects, where objects are first mapped into their "feature" representation in d-dimensional space. The challenge is to perform ε-join efficiently, and Super-EGO achieves the state of the art results.

How to Cite
When using Super-EGO code, please cite it as:

Dmitri V. Kalashnikov. Super-EGO: Fast multi-dimensional similarity join. VLDB Journal, 2013 [Download Paper]

The above publication describes Super-EGO in detail. A BibTeX entry for this publications is:
@article{VLDBJ13::dvk, author = {Dmitri V.\ Kalashnikov}, title = {Super-{EGO}: Fast Multi-Dimensional Similarity Join}, journal = {VLDB Journal}, year = {2013} }

Downloading Code

Super-EGO code can de downloaded from here: [SuperEGO_code.zip] [License] [SuperEGO_data.zip]

Super-EGO is implemented in C++

The code is designed for UNIX in general

The code has been tested only under Mac OS X Ver 10.8.2

GCC 4.7 has been used to compile the code.

To install GCC 4.7 on a Mac, install macport. Then sudo port install gcc47

Code generated by GCC is faster than that by the default compiler.

Compiling Code

Unzip SuperEGO_code.zip file. The code is inside Super-EGO folder. The main file is test.cpp.

Unzip the sample dataset file. The resulting file is ColorHist.txt.

Edit const.h file to set the desired data dimensionality (NUM_DIM) and the path to the input file (DATA_FILE).

Edit ./mak batch file: change the path of GCC's C++ compiler (g++) to where it is located in your system.

To compile, run ./mak inside Super-EGO folder.

Compilation will produce executable file called index.

Running Code
./index eps A_sz B_sz skew num_thread
Options

eps The value of ε to use in the ε-join.

A_sz Cardinilaity |A| of dataset A, in thousands. E.g., specifying 68 corresponds to |A| = 68,000.

B_sz Cardinilaity |B| of dataset B, in thousands. E.g., specifying 25 corresponds to |B| = 25,000.

skew Parameter skew can be set to 0, 2, or 3 to mean:

0 - uniform: the code will generate uniform A and B d-dimensional datasets.
2 - from file: the code will load d-dimensional data from the file specified in const.h.
3 - uniform self-join: the code will generate uniform A, and then join it with itself.

num_thread The number of parallel threads of execution to use. The best number typically corresponds to the level of parallelism the machine has. It also can be determined experimentally, once per machine, by increasing it starting from 1 and observing for which value the performance is the best.

Examples
1) Performing an ε-join on uniform datasets A and B, for ε = 0.1 and where |A| = 68,000 and |B| = 25,000 and the number of threads is 8. The dimensionality d of the generated data in A and B will be determinied by NUM_DIM varible from const.h file.

./index 0.1 68 25 0 8

2) Performing an self-join of uniform datasets A, for ε = 0.2 and where |A| = 45,000 and the number of threads is 4. The dimensionality d of the generated data in A will be determinied by NUM_DIM varible from const.h file.

./index 0.2 45 45 3 4

3) Performing an self-join of a real dataset ColorHist, for ε = 0.1 whose cardiniality is 68,000 using 8 threads. In const.h, variables NUM_DIM and DATA_FILE should be set to 32 and to the path of ColorHist.txt, respectively and the code should be recompiled.

./index 0.1 68 68 2 8

Back to Kalashnikov's homepage

Copyright © 2013 Dmitri V. Kalashnikov. All Rights Reserved.