Intrinsic Dimension in Data Analysis
- Intrinsic Dimension (ID) is the minimum number of parameters required to represent high-dimensional data lying on a lower-dimensional manifold.
- Various estimation techniques, such as nearest-neighbor, fractal, and spectral methods, reveal the effective complexity and redundancy in data representations.
- Empirical studies demonstrate that lower intrinsic dimensions often correlate with improved model generalization and drive efficient feature selection and compression.
Intrinsic Dimension (ID) is the minimal number of degrees of freedom required to describe a set of high-dimensional data without significant information loss (Kataiwa et al., 4 Mar 2025) (Pope et al., 2021). While the extrinsic dimension (ED) is defined by the ambient coordinate space, ID captures the dimension of the underlying manifold or structure on which the data lie, quantifying the effective complexity and redundancy in data representations.
1. Mathematical Definition and Theoretical Foundations
Let denote a dataset in a high-dimensional ambient space. The intrinsic dimension is the smallest integer such that, locally, the data can be parametrized by real variables, i.e., they lie (approximately) on a -dimensional manifold embedded in (Causin et al., 9 Jul 2025). Formally, where each point admits a homeomorphism .
Various approaches link ID to statistical and geometric properties:
- Manifold hypothesis: Data concentrate near a smooth manifold of low ID.
- Concentration of measure: High-dimensional point clouds often exhibit sharp regularities (e.g., linear separability, angular concentration) that inform ID estimators (Stubbemann et al., 2022) (Albergante et al., 2019) (Bac et al., 2020).
- Spectral methods (e.g., autoencoder Jacobians): The numerical rank of the Riemannian metric induced by a generative model reveals ID (Causin et al., 9 Jul 2025).
2. Key Estimation Methodologies
Several classes of estimators have emerged, each exploiting different geometric, statistical, or topological aspects of the data:
2.1 Nearest-Neighbor–Based Methods
Maximum-likelihood methods such as Levina–Bickel MLE estimate the local ID at each data point using the geometry of its -nearest neighbors:
is the distance to the th nearest neighbor. The global ID is then the harmonic mean over all points (Kataiwa et al., 4 Mar 2025) (Pope et al., 2021). Fast implementations leverage FAISS or KD-trees.
2.2 Expansion-Rate (Pareto, TwoNN) Methods
The distribution of ratios of nearest-neighbor distances, e.g., , follows a Pareto law for uniform sampling:
MLE or Bayesian fits yield global or local ID estimates (Ansuini et al., 2019) (Allegra et al., 2019).
2.3 Angle and Concentration Methods
Separability and angle distribution analysis exploit concentration-of-measure effects in high dimensions. The probability that two normalized random vectors have inner product exceeding threshold α decays exponentially in dimension:
Solving for gives the Fisher-separability–based estimator (Albergante et al., 2019) (Bac et al., 2020).
2.4 Fractal, Correlation, and Morisita Methods
Classical fractal methods (box-counting, Grassberger–Procaccia correlation dimension) fit the scaling law of pairwise counts within distance . The Morisita estimator employs grid-based counting and multipoint scaling, robust in high sparsity (Golay et al., 2016).
2.5 Topological and Connectivity-Based Estimators
Topological methods (eDCF) assess local geometric connectivity (e.g., on a quantized grid), matching observed neighbor counts to combinatorial signatures of various manifold dimensions (Gupta et al., 18 Oct 2025).
2.6 Singular Metric Analysis via Generative Models
Spectral analysis of the pullback metric for the decoder of a β-VAE leads to ID estimation via the number of eigenvalues above a threshold (“spectral gap”) (Causin et al., 9 Jul 2025).
2.7 Diffusion Model Rank Deficiency
In score-based diffusion, stacking noisy score vectors for a data point yields a matrix whose rank deficiency directly computes ID as the codimension of the normal space spanned by the score vectors (Roset et al., 14 Nov 2025).
3. Empirical Characterization in Machine Learning
Intrinsic dimension has practical consequences and is measurable in a broad range of domains:
- Natural images: Despite extrinsic pixel dimensions (e.g., 784 for MNIST, 150,528 for ImageNet), measured IDs range from 7–13 (MNIST) to 26–43 (ImageNet), validating the low-dimensional manifold hypothesis (Pope et al., 2021).
- Token embeddings in LLMs: Embedding spaces (ED = 128–5120) often have IDs 13–122, with >90% redundancy for large models—ID remains nearly constant as ED grows, producing “saturated” redundancy ratios (Kataiwa et al., 4 Mar 2025).
- Deep neural networks: Across layers, ID typically “expands” to a peak and then compresses in deeper layers (the “humpback” profile). Final-layer ID is an empirical predictor of test-set accuracy, with lower ID correlating with better generalization (Konz et al., 15 Aug 2024) (Ansuini et al., 2019).
- Domain dependency: Medical imaging models peak in ID much earlier than natural image models, and peak hidden-layer ID is roughly proportional to raw data ID (empirically, input ID), demonstrating the data-driven capacity of learned representations (Konz et al., 15 Aug 2024).
- Molecular dynamics: Time-, space-, and state-resolved ID analysis of macromolecular simulations reveals distinct manifold complexity in folded and unfolded protein states, often more sharply than traditional geometric order parameters (Cazzaniga et al., 17 Nov 2025).
4. Practical Applications: Compression, Adaptation, and Feature Selection
- Model compression/adaptation: Low-rank adaptation methods such as LoRA benefit from ID-guided rank selection: setting the adapter rank estimated ID avoids catastrophic performance degradation in LLM fine-tuning; marginal gains saturate above ID (Kataiwa et al., 4 Mar 2025).
- Feature selection: The Morisita estimator drives greedy filter selection in the MBRM algorithm, identifying minimal feature subsets that span the intrinsic manifold, with empirical reduction of dimensionality by 50–75% and no loss in downstream accuracy (Golay et al., 2016).
- Outlier/anomaly detection: Local ID heterogeneity serves as an unsupervised indicator of structural transitions, e.g., in molecular folding or manifold clustering (Allegra et al., 2019) (Cazzaniga et al., 17 Nov 2025).
- Scalable geometric learning: Axiomatic, concentration-based IDs (Pestov) quantify discriminability and flattening across neighborhood aggregations in large graphs, informing depth and expressiveness in GNNs and guiding model early stopping (Stubbemann et al., 2022).
- Diffusion-based data analysis: Higher ID correlates with greater out-of-distribution-ness and richer morphology in astronomical images; classical estimators vastly underestimate ID in high-noise, high-complexity domains (Roset et al., 14 Nov 2025).
5. Scale Dependence and Limitations
ID estimation is inherently scale-dependent (Noia et al., 24 May 2024) (Gupta et al., 18 Oct 2025):
- At very small scales, measurement noise inflates ID.
- At large scales, manifold curvature, folding, or topology induce overestimation.
- Automated protocols like ABIDE select the optimal local neighborhood by maximizing the validity of the constant-density model via likelihood-ratio tests, yielding consistent and asymptotically normal ID estimates.
- Grid-based and connectivity methods adaptively balance scale and noise to robustly recover integer or fractal dimensions in noisy, high-dimensional scenarios (Gupta et al., 18 Oct 2025).
Discrete data (e.g., sequences, categorical data) require adapted estimators (e.g., I3D) leveraging Poisson-binomial models of counts rather than infinitesimal ball scaling, with verification via empirical CDF congruence and plateau analysis (Macocco et al., 2022).
6. Algorithmic and Software Ecosystem
A mature ecosystem exists for both global and local ID estimation:
- Python: scikit-dimension (skdim) implements 19 leading linear and nonlinear ID estimators, covering PCA-based, kNN-based, concentration-based, and fractal methodologies with both global and local scope (Bac et al., 2021).
- R: The “IDmining” package implements the Morisita estimator and MBRM feature selection.
- C++/Python: FAISS for scalable nearest-neighbor computations.
- Open-source diffusion and generative-model ID analysis pipelines exist for domain-specific applications (Roset et al., 14 Nov 2025) (Causin et al., 9 Jul 2025).
Selection of estimator and algorithm depends on dataset scale, noise, expected ID range, and whether local or global structure is of analytical interest.
7. Summary Table: Methodological Landscape
| Estimator Class | Key Formula/Approach | Best-use Scenario / Notes |
|---|---|---|
| Nearest-neighbor MLE | Levina–Bickel, TwoNN | Fast, general-purpose; excellent for manifold data |
| Concentration/Separability | Fisher-S, DANCo, ESS | Robust to noise, local/global ID, fast after PCA |
| Fractal (box-counting, Morisita) | Scaling pairwise counts, grid counts | Non-Euclidean, redundancy analysis |
| Singular metric (generative) | Spectral gap in VAE pullback metric | Latent-manifold analysis, inverse problems |
| Connectivity factor (eDCF) | Local lattice neighbor matches | Parallelizable, robust to scale, integrated into topology |
| Diffusion model rank deficiency | Rank of stacked scores | High-noise, very high-dimensional complex data |
Each method has regime-dependent strengths: practitioners are advised to average or ensemble estimates, verify ID plateaus over scale, and use model selection criteria (e.g., likelihood, evidence) where applicable (Bac et al., 2021).
References:
- (Kataiwa et al., 4 Mar 2025) Measuring Intrinsic Dimension of Token Embeddings
- (Pope et al., 2021) The Intrinsic Dimension of Images and Its Impact on Learning
- (Konz et al., 15 Aug 2024) Pre-processing and Compression: Understanding Hidden Representation Refinement Across Imaging Domains via Intrinsic Dimension
- (Ansuini et al., 2019) Intrinsic dimension of data representations in deep neural networks
- (Golay et al., 2016) Unsupervised Feature Selection Based on the Morisita Estimator of Intrinsic Dimension
- (Cazzaniga et al., 17 Nov 2025) MDIntrinsicDimension: Dimensionality-Based Analysis of Collective Motions in Macromolecules from Molecular Dynamics Trajectories
- (Stubbemann et al., 2022) Intrinsic Dimension for Large-Scale Geometric Learning
- (Causin et al., 9 Jul 2025) Estimating Dataset Dimension via Singular Metrics under the Manifold Hypothesis: Application to Inverse Problems
- (Roset et al., 14 Nov 2025) Intrinsic Dimension Estimation for Radio Galaxy Zoo using Diffusion Models
- (Gupta et al., 18 Oct 2025) eDCF: Estimating Intrinsic Dimension using Local Connectivity
- (Albergante et al., 2019) Estimating the effective dimension of large biological datasets using Fisher separability analysis
- (Macocco et al., 2022) Intrinsic dimension estimation for discrete metrics
- (Bac et al., 2020) Local intrinsic dimensionality estimators based on concentration of measure
- (Noia et al., 24 May 2024) Beyond the noise: intrinsic dimension estimation with optimal neighbourhood identification
- (Bac et al., 2021) Scikit-dimension: a Python package for intrinsic dimension estimation
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free