Fast similarity join for multi-dimensional data.

Information Systems Journal, 32(1), March 2007.


Dmitri V. Kalashnikov and Sunil Prabhakar

Department of Computer Sciences
Purdue University

Abstract

The efficient processing of multi-dimensional similarity joins is important for a large class of applications. The dimensionality of the data for these applications ranges from low to high. Most existing methods have focused on the execution of high-dimensional joins over large amounts of disk-based data. The increasing sizes of main memory available on current computers, and the need for efficient processing of spatial joins suggest that spatial joins for a large class of problems can be processed in main memory. In this paper, we develop two new in-memory spatial join algorithms, the Grid-join and EGO*-join, and study their performance. Through evaluation, we explore the domain of applicability of each approach and provide recommendations for the choice of a join algorithm depending upon the dimensionality of the data as well as the expected selectivity of the join. We show that the two new proposed join techniques substantially outperform the state of the art join algorithm, the EGO-join.


Keywords:

efficient similarity join, epsilon join, multi-dimensional join, spatial join, high-dimensional join, selectivity, grid join, main memory join, fast join


Remarks:

One of the algorithms covered in the paper, called EGO*-join, is so simple, that it can be implemented from scratch literally in one day. The paper demonstrates that that algorithm also outperforms the state of the art, when applied to high-dimensional data. Thus, EGO*-join is a good choice for an in-memory join algorithm in general, and, due to its simplicity, it is also a good choice for a baseline join algorithm.


Downloadable files:

Paper: IS07_dvk_join.pdf


BibTeX entry:

@article{IS07::dvkJ,
   author    = {Dmitri V.\ Kalashnikov and Sunil Prabhakar},
   title     = {Fast similarity join for multi-dimensional data},
   journal   = IS,
   volume    = 32,
   number    = 1,
   pages     = {160--177},
   month     = mar,    
   year      = 2007
}

Back to Kalashnikov's homepage