Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beyond Labels: Advancing Cluster Analysis with the Entropy of Distance Distribution (EDD) (2311.16621v1)

Published 28 Nov 2023 in stat.ML and cs.LG

Abstract: In the evolving landscape of data science, the accurate quantification of clustering in high-dimensional data sets remains a significant challenge, especially in the absence of predefined labels. This paper introduces a novel approach, the Entropy of Distance Distribution (EDD), which represents a paradigm shift in label-free clustering analysis. Traditional methods, reliant on discrete labels, often struggle to discern intricate cluster patterns in unlabeled data. EDD, however, leverages the characteristic differences in pairwise point-to-point distances to discern clustering tendencies, independent of data labeling. Our method employs the Shannon information entropy to quantify the 'peakedness' or 'flatness' of distance distributions in a data set. This entropy measure, normalized against its maximum value, effectively distinguishes between strongly clustered data (indicated by pronounced peaks in distance distribution) and more homogeneous, non-clustered data sets. This label-free quantification is resilient against global translations and permutations of data points, and with an additional dimension-wise z-scoring, it becomes invariant to data set scaling. We demonstrate the efficacy of EDD through a series of experiments involving two-dimensional data spaces with Gaussian cluster centers. Our findings reveal a monotonic increase in the EDD value with the widening of cluster widths, moving from well-separated to overlapping clusters. This behavior underscores the method's sensitivity and accuracy in detecting varying degrees of clustering. EDD's potential extends beyond conventional clustering analysis, offering a robust, scalable tool for unraveling complex data structures without reliance on pre-assigned labels.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)
  1. A novel intrinsic measure of data separability. Applied Intelligence, 52(15):17734–17750, 2022.
  2. Dcsi–an improved measure of cluster separability based on separation and connectedness. arXiv preprint arXiv:2310.12806, 2023.
  3. Quantifying the separability of data classes in neural networks. Neural Networks, 139:278–293, 2021.
  4. Analysis and visualization of sleep stages based on deep neural networks. Neurobiology of sleep and circadian rhythms, 10:100064, 2021.
  5. Analysis of multichannel eeg patterns during human sleep: a novel approach. Frontiers in human neuroscience, 12:121, 2018.
  6. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  7. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  9. Davide Castelvecchi. Can we open the black box of ai? Nature News, 538(7623):20, 2016.
  10. Cognitive computational neuroscience. Nature neuroscience, 21(9):1148–1160, 2018.
  11. Predictive coding and stochastic resonance as fundamental principles of auditory phantom perception. Brain, page awad255, 2023.
  12. Neural network based successor representations to form cognitive maps of space and language. Scientific Reports, 12(1):1–13, 2022.
  13. Neural network based formation of cognitive maps of semantic spaces and the putative emergence of abstract concepts. Scientific Reports, 13(1):3644, 2023.
  14. Conceptual cognitive maps formation with neural successor networks and word embeddings. arXiv preprint arXiv:2307.01577, 2023.
  15. Word class representations spontaneously emerge in a deep neural network trained on next word prediction. arXiv preprint arXiv:2302.07588, 2023.
Citations (1)

Summary

We haven't generated a summary for this paper yet.