Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Deterministic Information Bottleneck Method for Clustering Mixed-Type Data (2407.03389v2)

Published 3 Jul 2024 in stat.ME, cs.LG, and stat.ML

Abstract: In this paper, we present an information-theoretic method for clustering mixed-type data, that is, data consisting of both continuous and categorical variables. The proposed approach is built on the deterministic variant of the Information Bottleneck algorithm, designed to optimally compress data while preserving its relevant structural information. We evaluate the performance of our method against four well-established clustering techniques for mixed-type data -- KAMILA, K-Prototypes, Factor Analysis for Mixed Data with K-Means, and Partitioning Around Medoids using Gower's dissimilarity -- using both simulated and real-world datasets. The results highlight that the proposed approach offers a competitive alternative to traditional clustering techniques, particularly under specific conditions where heterogeneity in data poses significant challenges.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. The Information Bottleneck Method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, pages 368–377, 1999.
  2. The deterministic information bottleneck. Neural Computation, 29(6):1611–1630, 2017.
  3. Survey of state-of-the-art mixed data clustering algorithms. IEEE Access, 7:31883–31902, 2019.
  4. Distance-based clustering of mixed data. Wiley Interdisciplinary Reviews: Computational Statistics, 11(3):e1456, 2019.
  5. The information bottleneck and geometric clustering. Neural Computation, 31(3):596–612, 2019.
  6. Qi Li and Jeff Racine. Nonparametric estimation of distributions with categorical and continuous data. Journal of Multivariate Analysis, 86(2):266–292, 2003.
  7. Multivariate binary discrimination by the kernel method. Biometrika, 63(3):413–420, 1976.
  8. Cross-validation and the estimation of probability distributions with categorical data. Journal of Nonparametric Statistics, 18(1):69–100, 2006.
  9. A class of smooth estimators for discrete distributions. Biometrika, 68(1):301–309, 1981.
  10. Bernard. W. Silverman. Density Estimation for Statistics and Data Analysis. Routledge, 1st edition, 1998.
  11. Benchmarking distance-based partitioning methods for mixed-type data. Advances in Data Analysis and Classification, 17(3):701–724, 2023.
  12. A semiparametric method for clustering mixed data. Machine Learning, 105:419–458, 2016.
  13. Zhexue Huang. Clustering large data sets with mixed numeric and categorical values. In Proceedings of the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pages 21–34. Citeseer, 1997.
  14. Finding Groups in Data: An Introduction to Cluster Analysis, chapter 2, pages 68–125. John Wiley & Sons, 1990.
  15. John C Gower. A general coefficient of similarity and some of its properties. Biometrics, 27:857–871, 1971.
  16. A white paper on good research practices in benchmarking: The case of cluster analysis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 13(6):e1511, 2023.
  17. Comparing partitions. Journal of Classification, 2(2):193–218, 1985.
  18. UCI Machine Learning Repository, 2019. University of California, Irvine, School of Information and Computer Sciences.

Summary

We haven't generated a summary for this paper yet.