Theoretically-Efficient and Practical Parallel DBSCAN (1912.06255v4)

Published 12 Dec 2019 in cs.DS, cs.DB, cs.DC, and cs.LG

Abstract: The DBSCAN method for spatial clustering has received significant attention due to its applicability in a variety of data analysis tasks. There are fast sequential algorithms for DBSCAN in Euclidean space that take $O(n\log n)$ work for two dimensions, sub-quadratic work for three or more dimensions, and can be computed approximately in linear work for any constant number of dimensions. However, existing parallel DBSCAN algorithms require quadratic work in the worst case, making them inefficient for large datasets. This paper bridges the gap between theory and practice of parallel DBSCAN by presenting new parallel algorithms for Euclidean exact DBSCAN and approximate DBSCAN that match the work bounds of their sequential counterparts, and are highly parallel (polylogarithmic depth). We present implementations of our algorithms along with optimizations that improve their practical performance. We perform a comprehensive experimental evaluation of our algorithms on a variety of datasets and parameter settings. Our experiments on a 36-core machine with hyper-threading show that we outperform existing parallel DBSCAN implementations by up to several orders of magnitude, and achieve speedups by up to 33x over the best sequential algorithms.

Citations (51)

View on Semantic Scholar

Summary

The paper introduces new parallel DBSCAN algorithms for exact and approximate solutions, achieving optimal theoretical work complexity and practical efficiency on multicore systems.
Experiments show the new algorithms achieve up to 33x speedup over sequential methods and outperform prior parallel implementations by orders of magnitude.
These efficient algorithms are suitable for large-scale data analysis in fields like transportation, astronomy, and biology, bridging the gap between theory and practice.

An Analysis of Theoretically-Efficient and Practical Parallel DBSCAN

The paper "Theoretically-Efficient and Practical Parallel DBSCAN" by Yiqiu Wang, Yan Gu, and Julian Shun addresses the computational inefficiency of existing parallel algorithms for density-based spatial clustering of applications with noise (DBSCAN), a widely adopted method for identifying clusters in spatial data amidst noise. Despite the existence of sequential algorithms with optimal work complexity for DBSCAN in Euclidean spaces, their parallel counterparts have traditionally suffered from quadratic work complexity. This paper presents parallel DBSCAN algorithms for both exact and approximate solutions that align with the work bounds of the best sequential solutions while also achieving polylogarithmic depth, making them theoretically efficient and practically efficient in real-world applications.

Summary of Contributions

New Parallel Algorithms:
- The paper introduces parallel algorithms for 2D exact DBSCAN and higher-dimensional exact and approximate DBSCAN. These algorithms match the optimal work bounds of their sequential equivalents, employing techniques such as grid or box constructions for point partitioning and leveraging data structures like Delaunay triangulations and k-d trees.
Highly-Optimized Implementations:
- Implementations of the proposed algorithms are enhanced through optimizations that aim to improve practical runtimes, especially on modern multicore architectures.
Comprehensive Experimental Evaluation:
- Through experiments conducted on a 36-core machine with hyper-threading, the new implementations demonstrate up to 33x speedup over sequential algorithms and outperform existing parallel implementations by several orders of magnitude under accurate parameter settings.

Detailed Analysis

The parallel algorithms presented use a work-depth model to assess their theoretical efficiency, targeting two distinct goals: achieving work efficiency by maintaining the work to the level of the best possible sequential algorithms and ensuring low computational depth to maximize parallel execution opportunities on multicore systems.

2D Algorithm Innovations

For two-dimensional DBSCAN, the core innovations include:

Grid and Box Methods: Points are partitioned into grid cells or bounding boxes, allowing for efficient parallel processing without quadratic scaling.
Cell Graph Construction: Efficient algorithms are developed using Bichromatic Closest Pair (BCP) searches, Delaunay triangulations, and unit spherical emptiness checking (USEC) to ascertain cluster connectivity without fully quadratic overhead.

Higher Dimensional Adaptations

In higher dimensions, challenges increase with the number of possible neighbor checks. Advances include:

Quadtree Data Structures: Utilized for efficiently resolving range queries during core point determination, significantly reducing computational overhead from exponential to sub-quadratic in constant dimensions.
Parallel RPO Trees: Employed for efficient nearest neighbor querying in higher dimensional spaces, ensuring the work remains sub-quadratic in practice.

The proposed algorithms also incorporate theoretical insights and practical techniques such as parallel prefix sums, filtering, and hashing, ensuring scalability and efficiency for large datasets.

Experimental Results

Experiments across various synthetic and real-world datasets validate the practical efficacy of the proposed parallel DBSCAN approaches. Notably, the algorithms display consistent performance improvements over prior parallel methods, achieving substantial speedups across a range of dataset densities and dimensionalities. Noteworthy results include processing the largest dataset used for exact DBSCAN in literature, outperforming the state-of-the-art distributed RP-DBSCAN by at least a factor of 18x under comparable conditions.

Implications and Future Directions

These algorithms effectively bridge the divide between theoretical potential and practical application in the field of parallel DBSCAN implementations. The successful demonstration of work-efficient algorithms with polylogarithmic depth underlines their suitability for large-scale data analytics in domains like transportation, astronomy, and biology.

Future developments could focus on parallel adaptations for DBSCAN variants and hierarchical clustering methods, further optimizing these algorithms for large-scale distributed systems. The frameworks described can also inspire development in more complex machine learning and data processing tasks, leveraging advanced parallel computation architectures.

Related Papers

YouTube

Show All Videos