Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 172 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 40 tok/s Pro
GPT-4o 100 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 436 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

FLASC: A Flare-Sensitive Clustering Algorithm (2311.15887v2)

Published 27 Nov 2023 in cs.LG and cs.DB

Abstract: Clustering algorithms are often used to find subpopulations in exploratory data analysis workflows. Not only the clusters themselves, but also their shape can represent meaningful subpopulations. In this paper, we present FLASC, an algorithm that detects branches within clusters to identify such subpopulations. FLASC builds upon HDBSCAN*, a state-of-the-art density-based clustering algorithm, and detects branches in a post-processing step that describes within-cluster connectivity. Two variants of the algorithm are presented, which trade computational cost for noise robustness. We show that both variants scale similarly to HDBSCAN* in terms of computational cost and provide stable outputs using synthetic data sets, resulting in an efficient flare-sensitive clustering algorithm. In addition, we demonstrate the benefit of branch-detection on two real-world data sets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. D. F. Andrews and A. M. Herzberg. 1985. Chemical and Overt Diabetes. In Data A Collect. Probl. from Many Fields Student Res. Work. Springer New York, New York, NY, 215–220. https://doi.org/10.1007/978-1-4612-5098-2_37
  2. Regularization of Mixture Models for Robust Principal Graph Learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 12 (2022), 9119–9130. https://doi.org/10.1109/TPAMI.2021.3124973
  3. On the bottleneck stability of rank decompositions of multi-parameter persistence modules. arXiv:2208.00300 [math.AT]
  4. Density-Based Clustering Based on Hierarchical Density Estimates. In Pacific-Asia conference on knowledge discovery and data mining. Springer, Berlin, Heidelberg, Germany, 160–172. https://doi.org/10.1007/978-3-642-37456-2_14
  5. Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. ACM Trans. Knowl. Discov. Data 10, 1 (jul 2015), 1–51. https://doi.org/10.1145/2733381
  6. The single-cell transcriptional landscape of mammalian organogenesis. Nature 566, 7745 (feb 2019), 496–502. https://doi.org/10.1038/s41586-019-0969-x
  7. Gunnar Carlsson. 2014. Topological pattern recognition for point cloud data. Acta Numer. 23, 2014 (may 2014), 289–368. https://doi.org/10.1017/S0962492914000051
  8. The Specious Art of Single-Cell Genomics. bioRxiv (2021), 1–23. https://doi.org/10.1101/2021.08.25.457696
  9. Gromov-Hausdorff Stable Signatures for Shapes using Persistence. Comput. Graph. Forum 28, 5 (jul 2009), 1393–1403. https://doi.org/10.1111/j.1467-8659.2009.01516.x
  10. Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets. Entropy 22, 11 (nov 2020), 1274. https://doi.org/10.3390/e22111274
  11. Ayush Dalmia and Suzanna Sia. 2021. Clustering with UMAP: Why and How Connectivity Matters. arXiv:2108.05525 [cs.AI]
  12. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proc. 20th Int. Conf. World wide web. ACM, New York, NY, USA, 577–586. https://doi.org/10.1145/1963405.1963487
  13. The C. elegans Sequencing Consortium*. 1998. Genome Sequence of the Nematode C. elegans: A Platform for Investigating Biology. Science 282, 5396 (1998), 2012–2018. https://doi.org/10.1126/science.282.5396.2012
  14. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proc. 2nd Int. Conf. Knowl. Discov. Data Min. AAAI Press, Portland, OR, USA, 226–231.
  15. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 5 (may 2018), 421–427. https://doi.org/10.1038/nbt.4091
  16. John A Hartigan. 1975. Clustering algorithms. Vol. 209. Wiley, New York, NY, USA.
  17. Lawrence Hubert and Phipps Arabie. 1985. Comparing partitions. J. Classif. 2, 1 (dec 1985), 193–218. https://doi.org/10.1007/BF01908075
  18. Scaling HDBSCAN Clustering with kNN Graph Approximation. In Proceedings of the SysML Conference. Stanford, CA, USA, 14–16.
  19. Detecting Divergent Subpopulations in Phenomics Data using Interesting Flares. In Proc. 2018 ACM Int. Conf. Bioinformatics, Comput. Biol. Heal. Informatics. ACM, New York, NY, USA, 155–164. https://doi.org/10.1145/3233547.3233593
  20. Michael Kerber and Alexander Rolle. 2021. Fast Minimal Presentations of Bi-graded Persistence Modules. In 2021 Proc. Work. Algorithm Eng. Exp. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 207–220. https://doi.org/10.1137/1.9781611976472.16
  21. The (black) art of runtime evaluation: Are we comparing algorithms or implementations? Knowledge and Information Systems 52, 2 (2017), 341–378. https://doi.org/10.1007/s10115-016-1004-2
  22. Michael Lesnick and Matthew Wright. 2015. Interactive Visualization of 2-D Persistence Modules. arXiv:1512.00180 [math.AT]
  23. Michael Lesnick and Matthew Wright. 2022. Computing Minimal Presentations and Bigraded Betti Numbers of 2-Parameter Persistent Homology. arXiv:1902.05708 [math.AT]
  24. Persistent homology and the branching topologies of plants. Am. J. Bot. 104, 3 (2017), 349–353. https://doi.org/10.3732/ajb.1700046
  25. Extracting insights from the shape of complex data using topology. Sci. Rep. 3 (2013), 1–8. https://doi.org/10.1038/srep01236
  26. A Visual Analytics Framework for Analysis of Patient Trajectories. In Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (BCB ’19). Association for Computing Machinery, New York, NY, USA, 15–24. https://doi.org/10.1145/3307339.3342143
  27. Claudia Malzer and Marcus Baum. 2020. A Hybrid Approach To Hierarchical Density-based Cluster Selection. In 2020 IEEE Int. Conf. Multisens. Fusion Integr. Intell. Syst., Vol. 2020-Septe. IEEE, Karlsruhe, Germany, 223–228. https://doi.org/10.1109/MFI49285.2020.9235263 arXiv:1911.02282
  28. Principal Graph and Structure Learning Based on Reversed Graph Embedding. IEEE Trans. Pattern Anal. Mach. Intell. 39, 11 (2017), 2227–2241. https://doi.org/10.1109/TPAMI.2016.2635657
  29. Leland McInnes and John Healy. 2017. Accelerated Hierarchical Density Based Clustering. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, New Orleans, LA, USA, 33–42. https://doi.org/10.1109/ICDMW.2017.12
  30. hdbscan: Hierarchical density based clustering. The Journal of Open Source Software 2, 11 (2017), 205. https://doi.org/10.21105/JOSS.00205
  31. HDBSCAN Documentation: How Soft Clustering for HDBSCAN Works. https://hdbscan.readthedocs.io/en/latest/soft_clustering_explanation.html. Accessed: 2022-12-09, Revision: 109797c7.
  32. Daniel Müllner. 2013. fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python. Journal of Statistical Software 53, 9 (2013), 1–18. https://doi.org/10.18637/jss.v053.i09
  33. Assessing single-cell transcriptomic variability through density-preserving data visualization. Nat. Biotechnol. 39, 6 (jun 2021), 765–774. https://doi.org/10.1038/s41587-020-00801-7
  34. MustaCHE. Proc. VLDB Endow. 11, 12 (aug 2018), 2058–2061. https://doi.org/10.14778/3229863.3236259
  35. Efficient Computation and Visualization of Multiple Density-Based Clustering Hierarchies. IEEE Trans. Knowl. Data Eng. 33, 8 (2021), 3075–3089. https://doi.org/10.1109/TKDE.2019.2962412
  36. A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution. Science (80-. ). 365, 6459 (sep 2019), eaax1971. https://doi.org/10.1126/science.aax1971
  37. Monocle 3 Documentation: Constructing single-cell trajectories. https://cole-trapnell-lab.github.io/monocle3/docs/trajectories/. Accessed: 2023-05-04, Revision: 0d1cf4d.
  38. Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods 14, 10 (oct 2017), 979–982. https://doi.org/10.1038/nmeth.4402
  39. G. M. Reaven and R. G. Miller. 1979. An attempt to define the nature of chemical diabetes using a multidimensional analysis. Diabetologia 16, 1 (jan 1979), 17–24. https://doi.org/10.1007/BF00423145
  40. Luis Scoccola and Alexander Rolle. 2023. Persistable: persistent and stable clustering. J. Open Source Softw. 8, 83 (2023), 5022. https://doi.org/10.21105/joss.05022
  41. R. Sibson. 1973. SLINK: An optimally efficient algorithm for the single-link cluster method. Comput. J. 16, 1 (01 1973), 30–34. https://doi.org/10.1093/comjnl/16.1.30
  42. Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. PGB@ Eurographics 2 (sep 2007), 91–100.
  43. Yara Skaf and Reinhard Laubenbacher. 2022. Topological data analysis in biomedicine: A review. J. Biomed. Inform. 130, November 2021 (2022), 104082. https://doi.org/10.1016/j.jbi.2022.104082
  44. Geoffrey Stewart and Mahmood Al-Khassaweneh. 2022. An Implementation of the HDBSCAN* Clustering Algorithm. Appl. Sci. 12, 5 (feb 2022), 2405. https://doi.org/10.3390/app12052405
  45. Godfried T. Toussaint. 1980. The relative neighbourhood graph of a finite planar set. Pattern Recognit. 12, 4 (jan 1980), 261–268. https://doi.org/10.1016/0031-3203(80)90066-7
  46. Stable topological signatures for metric trees through graph approximations. Pattern Recognit. Lett. 147 (jul 2021), 85–92. https://doi.org/10.1016/j.patrec.2021.03.035
  47. D. Wishart. 1969. Mode analysis, a generalization of nearest neighbour which reduces chaining. In Numerical Taxonomy, A. J. Cole (Ed.). Academic Press, London, New York, 282–311.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.