Similarity Join for Low- and High-Dimensional Data.

Appeared in DASFAA 2003 Conference.


Dmitri V. Kalashnikov and Sunil Prabhakar

Department of Computer Sciences
Purdue University

Abstract

The efficient processing of similarity joins is important for a large class of applications. The dimensionality of the data for these applications ranges from low to high. Most existing methods have focused on the execution of high-dimensional joins over large amounts of disk-based data. The increasing sizes of main memory available on current computers, and the need for efficient processing of spatial joins suggest that spatial joins for a large class of problems can be processed in main memory. In this paper we develop two new spatial join algorithms, the Grid-join and EGO*-join, and study their performance in comparison to the state of the art algorithm, EGO-join, and the RSJ algorithm. Through evaluation we explore the domain of applicability of each algorithm and provide recommendations for the choice of join algorithm depending upon the dimensionality of the data as well as the critical epsilon parameter. We also point out the significance of the choice of this parameter for ensuring that the selectivity achieved is reasonable. The proposed EGO*-join algorithm always, often significantly, outperforms the EGO-join. For low-dimensional data the Grid-join outperform both the EGO- and EGO*- joins. An analysis of the cost of the Grid-join is presented and highly accurate cost estimator functions are developed. These are used to choose an appropriate grid size for optimal performance and can also be used by a query optimizer to compute the estimated cost of the Grid-join.


Keywords:

Similarity join, epsilon join, multidimensional join, high-dimensional join, selectivity, grid join, main memory join, EGO* join


Downloadable files:

Paper: DASFAA03_dvk.pdf
Paper (revised extended version): DASFAA03_dvk.ext.pdf


BibTeX entry:

@inproceedings{DASFAA03::dvk,
   author    = {Dmitri V. Kalashnikov and Sunil Prabhakar},
   title     = {Similarity Join for Low- and High- Dimensional Data},
   booktitle = {Proc. of Int'l Conf. on Database Systems for Advanced 
                Applications (DASFAA 2003), {IEEE Computer Society Press}},
   year      = {2003},
   month     = {March 26--28},
   address   = {Kyoto, Japan}
}
 

Back to Kalashnikov's homepage