Scalable Density-based Clustering with Random Projections (2402.15679v2)
Abstract: We present sDBSCAN, a scalable density-based clustering algorithm in high dimensions with cosine distance. Utilizing the neighborhood-preserving property of random projections, sDBSCAN can quickly identify core points and their neighborhoods, the primary hurdle of density-based clustering. Theoretically, sDBSCAN outputs a clustering structure similar to DBSCAN under mild conditions with high probability. To further facilitate sDBSCAN, we present sOPTICS, a scalable OPTICS for interactive exploration of the intrinsic clustering structure. We also extend sDBSCAN and sOPTICS to L2, L1, $\chi2$, and Jensen-Shannon distances via random kernel features. Empirically, sDBSCAN is significantly faster and provides higher accuracy than many other clustering algorithms on real-world million-point data sets. On these data sets, sDBSCAN and sOPTICS run in a few minutes, while the scikit-learn's counterparts demand several hours or cannot run due to memory constraints.
- Parameter-free Locality Sensitive Hashing for Spherical Range Reporting. In SODA. 239–256.
- Practical and Optimal LSH for Angular Distance. In NIPS. 1225–1233.
- Approximate Nearest Neighbor Search in High Dimensions. CoRR abs/1806.09823 (2018).
- OPTICS: Ordering Points To Identify the Clustering Structure. In SIGMOD. 49–60.
- David Arthur and Sergei Vassilvitskii. 2007. k-means++: the advantages of careful seeding. In SODA. 1027–1035.
- A Unified Approach to Approximate Proximity Searching. In ESA, Vol. 6346. 374–385.
- Density-based clustering. WIREs Data Mining Knowl. Discov. 10, 2 (2020).
- Approximate range searching in higher dimension. Comput. Geom. 39, 1 (2008), 24–29.
- H. A. David and J. Galambos. 1974. The Asymptotic Theory of Concomitants of Order Statistics. Journal of Applied Probability 11, 4 (1974), 762–770.
- Faster DBSCAN and HDBSCAN in Low-Dimensional Euclidean Spaces. Int. J. Comput. Geom. Appl. 29, 1 (2019), 21–47.
- Kernel k-means: spectral clustering and normalized cuts. In SIGKDD. 551–556.
- Jeff Erickson. 1995. On the relative complexities of some geometric problems. In CCCG. 85–90.
- Almost Linear Time Density Level Set Estimation via DBSCAN. In AAAI. 7349–7357.
- A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In KDD. 226–231.
- Junhao Gan and Yufei Tao. 2017. On the Hardness and Approximation of Euclidean DBSCAN. ACM Trans. Database Syst. 42, 3 (2017), 14:1–14:45.
- Xiaogang Huang and Tiefeng Ma. 2023. Fast Density-Based Clustering: Geometric Approach. Proc. ACM Manag. Data 1, 1 (2023), 58:1–58:24.
- Which Problems Have Strongly Exponential Complexity? J. Comput. Syst. Sci. 63, 4 (2001).
- Jennifer Jang and Heinrich Jiang. 2019. DBSCAN++: Towards fast and scalable density clustering. In ICML. 3019–3029.
- Heinrich Jiang. 2017. Density Level Set Estimation on Manifolds with DBSCAN. In ICML, Vol. 70. 1684–1693.
- Faster DBSCAN via subsampled similarity queries. In NeurIPS.
- David R. Karger. 1999. Random Sampling in Cut, Flow, and Network Design Problems. Math. Oper. Res. 24, 2 (1999), 383–413.
- K. Mahesh Kumar and A. Rama Mohan Reddy. 2016. A fast DBSCAN clustering algorithm by accelerating neighbor searching using Groups method. Pattern Recognit. 58 (2016), 39–48.
- AnyDBC: An Efficient Anytime Density-based Clustering Algorithm for Very Large Complex Datasets. In SIGKDD. ACM, 1025–1034.
- Jirà Matousek. 1994. Geometric Range Searching. ACM Comput. Surv. 26, 4 (1994), 421–461.
- Ninh Pham. 2021. Simple Yet Efficient Algorithms for Maximum Inner Product Search via Extreme Order Statistics. In KDD. 1339–1347.
- Ninh Pham and Tao Liu. 2022. Falconn++: A Locality-sensitive Filtering Approach for Approximate Nearest Neighbor Search. In NeurIPS.
- Ali Rahimi and Benjamin Recht. 2007. Random Features for Large-Scale Kernel Machines. In NIPS. 1177–1184.
- Johannes Schneider and Michail Vlachos. 2017. Scalable density-based clustering with quality guarantees using random projections. Data Min. Knowl. Discov. 31, 4 (2017), 972–1005.
- DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Trans. Database Syst. 42, 3 (2017), 19:1–19:21.
- Jianbo Shi and Jitendra Malik. 2000. Normalized Cuts and Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22, 8 (2000), 888–905.
- Andrea Vedaldi and Andrew Zisserman. 2012. Efficient Additive Kernels via Explicit Feature Maps. IEEE Trans. Pattern Anal. Mach. Intell. 34, 3 (2012), 480–492.
- P. Viswanath and V. Suresh Babu. 2009. Rough-DBSCAN: A fast hybrid density based clustering method for large data sets. Pattern Recognit. Lett. 30, 16 (2009), 1477–1488.
- Martin J. Wainwright. 2019. Basic tail and concentration bounds. Cambridge University Press, 21–57.
- Scalable Kernel K-Means Clustering with Nyström Approximation: Relative-Error Bounds. J. Mach. Learn. Res. 20 (2019), 12:1–12:49.
- Ryan Williams and Huacheng Yu. 2014. Finding orthogonal vectors in discrete structures. In SODA. 1867–1877.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.