Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beyond the noise: intrinsic dimension estimation with optimal neighbourhood identification (2405.15132v2)

Published 24 May 2024 in stat.ML, cs.LG, math.ST, stat.CO, stat.ME, and stat.TH

Abstract: The Intrinsic Dimension (ID) is a key concept in unsupervised learning and feature selection, as it is a lower bound to the number of variables which are necessary to describe a system. However, in almost any real-world dataset the ID depends on the scale at which the data are analysed. Quite typically at a small scale, the ID is very large, as the data are affected by measurement errors. At large scale, the ID can also be erroneously large, due to the curvature and the topology of the manifold containing the data. In this work, we introduce an automatic protocol to select the sweet spot, namely the correct range of scales in which the ID is meaningful and useful. This protocol is based on imposing that for distances smaller than the correct scale the density of the data is constant. In the presented framework, to estimate the density it is necessary to know the ID, therefore, this condition is imposed self-consistently. We derive theoretical guarantees and illustrate the usefulness and robustness of this procedure by benchmarks on artificial and real-world datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Optical Recognition of Handwritten Digits. UCI Machine Learning Repository. DOI: https://doi.org/10.24432/C50P49.
  2. Intrinsic dimension of data representations in deep neural networks. Advances in Neural Information Processing Systems, 32.
  3. Lectures on the nearest neighbor method, volume 246. Springer.
  4. Plumed: A portable plugin for free-energy calculations with molecular dynamics. Computer Physics Communications, 180(10):1961–1972.
  5. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press.
  6. On nonlinear contractions. Proceedings of the American Mathematical Society, 20(2):458–464.
  7. Ćirić, L. B. (1974). A generalization of banach’s contraction principle. Proceedings of the American Mathematical society, 45(2):267–273.
  8. Which similarity measure is better for analyzing protein structures in a molecular dynamics trajectory? Physical Chemistry Chemical Physics, 13(22):10421–10425.
  9. The generalized ratios intrinsic dimension estimator. Scientific Reports, 12(1):20005.
  10. Automatic topography of high-dimensional data sets by non-parametric density peak clustering. Information Sciences, 560:476–492.
  11. An omnibus test for the two-sample problem using the empirical characteristic function. Journal of Statistical Computation and Simulation, 26(3-4):177–203.
  12. Intrinsic dimension estimation for locally undersampled data. Scientific reports, 9(1):17133.
  13. Estimating the intrinsic dimension of datasets by a minimal neighborhood information. Scientific Reports, 7(1):1–8.
  14. The intrinsic dimension of protein sequence evolution. PLoS computational biology, 15(4):e1006767.
  15. Dadapy: Distance-based analysis of data-manifolds in python.
  16. Intrinsic-dimension analysis for guiding dimensionality reduction and data fusion in multi-omics data processing. bioRxiv, pages 2024–01.
  17. 10 residue folded peptide designed by segment statistics. Structure, 12(8):1507–1518.
  18. The advanced theory of statistics. Volume 2, Inference and Relationship. Charles Griffin and Co., Ltd., London.
  19. Handbook of metric fixed point theory. Australian Mathematical Society GAZETTE, 29(2).
  20. Le Cam, L. (1986). Asymptotic methods in statistical decision theory. Springer Science & Business Media.
  21. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2.
  22. Intrinsic dimension estimation: Relevant techniques and a benchmark framework. Mathematical Problems in Engineering, 2015:759567.
  23. Maximum likelihood estimation of intrinsic dimension. In Saul, L., Weiss, Y., and Bottou, L., editors, Advances in Neural Information Processing Systems, volume 17. MIT Press.
  24. Intrinsic dimension estimation for discrete metrics. Phys. Rev. Lett., 130:067401.
  25. Intrinsic dimension of path integrals: Data-mining quantum criticality and emergent simplicity. PRX Quantum, 2(3):030332.
  26. Unsupervised learning universal critical behavior via the intrinsic dimension. Physical Review X, 11(1):011040.
  27. Computing the free energy without collective variables. Journal of Chemical Theory and Computation, 14(3):1206–1215.
  28. Novel high intrinsic dimensionality estimators. Machine learning, 89:37–65.
  29. Replica-exchange molecular dynamics method for protein folding. Chemical physics letters, 314(1-2):141–151.
  30. The geometry of hidden representations of large transformer models. Advances in Neural Information Processing Systems, 36.
  31. Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. The annals of mathematical statistics, 9(1):60–62.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com