Beyond Labels: Advancing Cluster Analysis with the Entropy of Distance Distribution (EDD) (2311.16621v1)
Abstract: In the evolving landscape of data science, the accurate quantification of clustering in high-dimensional data sets remains a significant challenge, especially in the absence of predefined labels. This paper introduces a novel approach, the Entropy of Distance Distribution (EDD), which represents a paradigm shift in label-free clustering analysis. Traditional methods, reliant on discrete labels, often struggle to discern intricate cluster patterns in unlabeled data. EDD, however, leverages the characteristic differences in pairwise point-to-point distances to discern clustering tendencies, independent of data labeling. Our method employs the Shannon information entropy to quantify the 'peakedness' or 'flatness' of distance distributions in a data set. This entropy measure, normalized against its maximum value, effectively distinguishes between strongly clustered data (indicated by pronounced peaks in distance distribution) and more homogeneous, non-clustered data sets. This label-free quantification is resilient against global translations and permutations of data points, and with an additional dimension-wise z-scoring, it becomes invariant to data set scaling. We demonstrate the efficacy of EDD through a series of experiments involving two-dimensional data spaces with Gaussian cluster centers. Our findings reveal a monotonic increase in the EDD value with the widening of cluster widths, moving from well-separated to overlapping clusters. This behavior underscores the method's sensitivity and accuracy in detecting varying degrees of clustering. EDD's potential extends beyond conventional clustering analysis, offering a robust, scalable tool for unraveling complex data structures without reliance on pre-assigned labels.
- A novel intrinsic measure of data separability. Applied Intelligence, 52(15):17734–17750, 2022.
- Dcsi–an improved measure of cluster separability based on separation and connectedness. arXiv preprint arXiv:2310.12806, 2023.
- Quantifying the separability of data classes in neural networks. Neural Networks, 139:278–293, 2021.
- Analysis and visualization of sleep stages based on deep neural networks. Neurobiology of sleep and circadian rhythms, 10:100064, 2021.
- Analysis of multichannel eeg patterns during human sleep: a novel approach. Frontiers in human neuroscience, 12:121, 2018.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Davide Castelvecchi. Can we open the black box of ai? Nature News, 538(7623):20, 2016.
- Cognitive computational neuroscience. Nature neuroscience, 21(9):1148–1160, 2018.
- Predictive coding and stochastic resonance as fundamental principles of auditory phantom perception. Brain, page awad255, 2023.
- Neural network based successor representations to form cognitive maps of space and language. Scientific Reports, 12(1):1–13, 2022.
- Neural network based formation of cognitive maps of semantic spaces and the putative emergence of abstract concepts. Scientific Reports, 13(1):3644, 2023.
- Conceptual cognitive maps formation with neural successor networks and word embeddings. arXiv preprint arXiv:2307.01577, 2023.
- Word class representations spontaneously emerge in a deep neural network trained on next word prediction. arXiv preprint arXiv:2302.07588, 2023.