Source Code of Super-EGO ε-joinLast updated: 2/13/2013IntroductionSuper-EGO is a very fast in-memory algorithm that performs a well-studied similarity join operation known as ε-join. Given two d-dimensional datasets A and B and parameter ε, the task of ε-join is to find all pairs of points (a,b), where a ∈ A and b ∈ B, such that the distance between them |a-b| ≤ ε. Most often the Euclidean distance is used to compute the distance |a-b|, or distance is defined as Lp norm. This operation is often employed in data-mining and other areas for finding all pairs of similar objects, where objects are first mapped into their "feature" representation in d-dimensional space. The challenge is to perform ε-join efficiently, and Super-EGO achieves the state of the art results. How to CiteWhen using Super-EGO code, please cite it as:The above publication describes Super-EGO in detail. A BibTeX entry for this publications is: @article{VLDBJ13::dvk, author = {Dmitri V.\ Kalashnikov}, title = {Super-{EGO}: Fast Multi-Dimensional Similarity Join}, journal = {VLDB Journal}, year = {2013} } Downloading CodeCompiling CodeRunning Code./index eps A_sz B_sz skew num_threadOptions
Examples1) Performing an ε-join on uniform datasets A and B, for ε = 0.1 and where |A| = 68,000 and |B| = 25,000 and the number of threads is 8. The dimensionality d of the generated data in A and B will be determinied by NUM_DIM varible from const.h file../index 0.1 68 25 0 8 2) Performing an self-join of uniform datasets A, for ε = 0.2 and where |A| = 45,000 and the number of threads is 4. The dimensionality d of the generated data in A will be determinied by NUM_DIM varible from const.h file. ./index 0.2 45 45 3 4 3) Performing an self-join of a real dataset ColorHist, for ε = 0.1 whose cardiniality is 68,000 using 8 threads. In const.h, variables NUM_DIM and DATA_FILE should be set to 32 and to the path of ColorHist.txt, respectively and the code should be recompiled. ./index 0.1 68 68 2 8 Back to Kalashnikov's homepage |