Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
124 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mixture-Models: a one-stop Python Library for Model-based Clustering using various Mixture Models (2402.10229v1)

Published 8 Feb 2024 in stat.CO and cs.LG

Abstract: \texttt{Mixture-Models} is an open-source Python library for fitting Gaussian Mixture Models (GMM) and their variants, such as Parsimonious GMMs, Mixture of Factor Analyzers, MClust models, Mixture of Student's t distributions, etc. It streamlines the implementation and analysis of these models using various first/second order optimization routines such as Gradient Descent and Newton-CG through automatic differentiation (AD) tools. This helps in extending these models to high-dimensional data, which is first of its kind among Python libraries. The library provides user-friendly model evaluation tools, such as BIC, AIC, and log-likelihood estimation. The source-code is licensed under MIT license and can be accessed at \url{https://github.com/kasakh/Mixture-Models}. The package is highly extensible, allowing users to incorporate new distributions and optimization techniques with ease. We conduct a large scale simulation to compare the performance of various gradient based approaches against Expectation Maximization on a wide range of settings and identify the corresponding best suited approach.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Singularities affect dynamics of learning in neuromanifolds. Neural Computation, 18(5):1007–1065, 2006.
  2. E. Anderson. The species problem in iris. Annals of the Missouri Botanical Garden, 23(3):457–509, 1936.
  3. Minimax theory for high-dimensional Gaussian mixtures with sparse mean separation. In Advances in Neural Information Processing Systems, pages 2139–2147, 2013.
  4. Efficient sparse clustering of high-dimensional non-spherical gaussian mixtures. In Artificial Intelligence and Statistics, pages 37–45, 2015.
  5. Statistical guarantees for the EM algorithm: From population to sample-based analysis. The Annals of Statistics, 45(1):77–120, 2017.
  6. Learning a mixture of gaussians via mixed-integer optimization. Informs Journal on Optimization, 1(3):221–240, 2019.
  7. P. J. Bickel and E. Levina. Some theory for fisher’s linear discriminant function,naive bayes’, and some alternatives when there are many more variables than observations. Bernoulli, 10(6):989–1010, 2004.
  8. C. Bouveyron and C. Brunet-Saumard. Model-based clustering of high-dimensional data: A review. Computational Statistics & Data Analysis, 71:52–78, 2014.
  9. High-dimensional data clustering. Computational Statistics & Data Analysis, 52(1):502–519, 2007.
  10. Isotropic pca and affine-invariant clustering. In Building Bridges, pages 241–281. Springer, 2008.
  11. Chime: Clustering of high-dimensional Gaussian mixtures with EM algorithm and its optimality. The Annals of Statistics, 47(3):1234–1267, 2019.
  12. A. Cayley. Sur quelques propriétés des déterminants gauches. 1846.
  13. W.-C. Chang. On using principal components before separating a mixture of two multivariate normal distributions. Journal of the Royal Statistical Society: Series C (Applied Statistics), 32(3):267–275, 1983.
  14. J. Chen and X. Tan. Inference for multivariate normal mixtures. Journal of Multivariate Analysis, 100(7):1367–1383, Aug. 2009.
  15. Penalized maximum likelihood estimator for normal mixtures. Scandinavian Journal of Statistics, 30(1):45–59, 2003.
  16. S. Dasgupta. Learning mixtures of gaussians. In 40th Annual Symposium on Foundations of Computer Science (Cat. No. 99CB37039), pages 634–644. IEEE, 1999.
  17. S. Dasgupta and L. J. Schulman. A two-round variant of em for gaussian mixtures. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence, pages 152–159, 2000.
  18. N. E. Day. Estimating the components of a mixture of normal distributions. Biometrika, 56(3):463–474, 1969.
  19. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.
  20. D. Dua and C. Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  21. R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2):179–188, 1936.
  22. Model-based clustering with sparse covariance matrices. Statistics and Computing, 29(4):791–819, 2019.
  23. Parvus, an extendible package for data exploration, classification and correlation. institute of pharmaceutical and food analysis and technologies. Via Brigata Salerno, 16147, 2008.
  24. C. Fraley and A. E. Raftery. Enhanced model-based clustering, density estimation, and discriminant analysis software: Mclust. Journal of Classification, 20(2):263–286, 2003.
  25. Z. Ghahramani and G. Hilton. The EM algorithm for mixture of factor analyzers. Techical Report CRG-TR-96-1, 1997.
  26. The EM algorithm for mixtures of factor analyzers. Technical report, Technical Report CRG-TR-96-1, University of Toronto, 1996.
  27. Pairwise variable selection for high-dimensional model-based clustering. Biometrics, 66(3):793–804, 2010.
  28. P. J. Huber. Robust statistics, 1981.
  29. S. Ingrassia. A likelihood-based constrained algorithm for multivariate normal mixture models. Statistical Methods and Applications, 13(2):151–166, 2004.
  30. S. Ingrassia and R. Rocci. Degeneracy of the EM algorithm for the mle of multivariate Gaussian mixtures and dynamic constraints. Computational Statistics & Data Analysis, 55(4):1715–1725, 2011.
  31. N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent data analysis, 6(5):429–449, 2002.
  32. Disentangling gaussians. Communications of the ACM, 55(2):113–120, 2012.
  33. The spectral method for general mixture models. In International Conference on Computational Learning Theory, pages 444–457. Springer, 2005.
  34. S. R. Kasa and V. Rajan. Improved inference of gaussian mixture copula model for clustering and reproducibility analysis using automatic differentiation. Econometrics and Statistics, 22:67–97, 2022.
  35. S. R. Kasa and V. Rajan. Avoiding inferior clusterings with misspecified gaussian mixture models. Scientific Reports, 13(1):19164, 2023.
  36. Gaussian mixture copulas for high-dimensional clustering and dependency-based subtyping. Bioinformatics, 36(2):621–628, 2020.
  37. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature medicine, 7(6):673–679, 2001.
  38. Autograd: Effortless gradients in numpy. In ICML 2015 AutoML Workshop, volume 238, 2015.
  39. Modelling high-dimensional data by mixtures of factor analyzers. Computational Statistics & Data Analysis, 41(3-4):379–388, 2003.
  40. P. McNicholas and T. Murphy. Parsimonious Gaussian mixture models. Statistics and Computing, 18:285–296, 2008a.
  41. Parsimonious Gaussian mixture models. Statistics and Computing, 18:285–296, 2008b.
  42. Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Computational Statistics and Data Analysis, 54(3):711–723, 2010.
  43. pgmm: Parsimonious Gaussian mixture models, 1, 2011.
  44. pgmm: Parsimonious Gaussian Mixture Models, 2018. URL https://CRAN.R-project.org/package=pgmm. R package version 1.2.3.
  45. K. P. Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
  46. W. Pan and X. Shen. Penalized model-based clustering with application to variable selection. Journal of Machine Learning Research, 8(May):1145–1164, 2007.
  47. H. Park and T. Ozeki. Singularity and slow convergence of the EM algorithm for Gaussian mixtures. Neural Processing letters, 29(1):45–59, 2009.
  48. D. Peel and G. J. McLachlan. Robust mixture modelling using the t distribution. Statistics and computing, 10:339–348, 2000.
  49. A. Raftery. Discussion of “bayesian clustering with variable and transformation selection” by liu et al. Bayesian Statistics, 7:266–271, 2003.
  50. A. E. Raftery and N. Dean. Variable selection for model-based clustering. Journal of the American Statistical Association, 101(473):168–178, 2006.
  51. Optimization with EM and expectation-conjugate-gradient. In Proceedings of the 20th International Conference on Machine Learning (ICML-03), pages 672–679, 2003.
  52. A. Sanjeev and R. Kannan. Learning mixtures of arbitrary gaussians. In Proceedings of the thirty-third annual ACM symposium on Theory of computing, pages 247–257, 2001.
  53. mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. The R Journal, 8(1):289, 2016.
  54. Model-Based Clustering, Classification, and Density Estimation Using mclust in R. Chapman and Hall/CRC, 2023. ISBN 978-1032234953. doi: 10.1201/9781003277965. URL https://mclust-org.github.io/book/.
  55. S. Vempala and G. Wang. A spectral algorithm for learning mixture models. Journal of Computer and System Sciences, 68(4):841–860, 2004.
  56. S. Wang and J. Zhu. Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics, 64(2):440–448, 2008.
  57. C. J. Wu et al. On the convergence properties of the EM algorithm. The Annals of statistics, 11(1):95–103, 1983.
  58. L. Xu and M. I. Jordan. On convergence properties of the EM algorithm for Gaussian mixtures. Neural Computation, 8(1):129–151, 1996.
  59. Principal component analysis for clustering gene expression data. Bioinformatics, 17(9):763–774, 2001.
  60. Penalized model-based clustering with unconstrained covariance matrices. Electronic Journal of Statistics, 3:1473, 2009.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets