Papers
Topics
Authors
Recent
2000 character limit reached

Intrinsic Dimension in Data Analysis

Updated 25 November 2025
  • Intrinsic Dimension (ID) is the minimum number of parameters required to represent high-dimensional data lying on a lower-dimensional manifold.
  • Various estimation techniques, such as nearest-neighbor, fractal, and spectral methods, reveal the effective complexity and redundancy in data representations.
  • Empirical studies demonstrate that lower intrinsic dimensions often correlate with improved model generalization and drive efficient feature selection and compression.

Intrinsic Dimension (ID) is the minimal number of degrees of freedom required to describe a set of high-dimensional data without significant information loss (Kataiwa et al., 4 Mar 2025) (Pope et al., 2021). While the extrinsic dimension (ED) is defined by the ambient coordinate space, ID captures the dimension of the underlying manifold or structure on which the data lie, quantifying the effective complexity and redundancy in data representations.

1. Mathematical Definition and Theoretical Foundations

Let X={x1,...,xN}RDX = \{x_1, ..., x_N\} \subset \mathbb{R}^D denote a dataset in a high-dimensional ambient space. The intrinsic dimension dd is the smallest integer such that, locally, the data can be parametrized by dd real variables, i.e., they lie (approximately) on a dd-dimensional manifold MM embedded in RD\mathbb{R}^D (Causin et al., 9 Jul 2025). Formally, d=dim(M)d = \text{dim}(M) where each point pMp \in M admits a homeomorphism ϕp:UpMRd\phi_p: U_p \subset M \to \mathbb{R}^d.

Various approaches link ID to statistical and geometric properties:

  • Manifold hypothesis: Data concentrate near a smooth manifold of low ID.
  • Concentration of measure: High-dimensional point clouds often exhibit sharp regularities (e.g., linear separability, angular concentration) that inform ID estimators (Stubbemann et al., 2022) (Albergante et al., 2019) (Bac et al., 2020).
  • Spectral methods (e.g., autoencoder Jacobians): The numerical rank of the Riemannian metric induced by a generative model reveals ID (Causin et al., 9 Jul 2025).

2. Key Estimation Methodologies

Several classes of estimators have emerged, each exploiting different geometric, statistical, or topological aspects of the data:

2.1 Nearest-Neighbor–Based Methods

Maximum-likelihood methods such as Levina–Bickel MLE estimate the local ID at each data point xx using the geometry of its kk-nearest neighbors:

LIDk(x)^=[1k1i=1k1lndk(x)di(x)]1\widehat{\mathrm{LID}_k(x)} = \Biggl[ \frac{1}{k-1} \sum_{i=1}^{k-1} \ln \frac{d_k(x)}{d_i(x)} \Biggr]^{-1}

di(x)d_i(x) is the distance to the iith nearest neighbor. The global ID is then the harmonic mean over all points (Kataiwa et al., 4 Mar 2025) (Pope et al., 2021). Fast implementations leverage FAISS or KD-trees.

2.2 Expansion-Rate (Pareto, TwoNN) Methods

The distribution of ratios of nearest-neighbor distances, e.g., μi=ri(2)/ri(1)\mu_i = r^{(2)}_i/r^{(1)}_i, follows a Pareto law for uniform sampling:

f(μd)=dμ(d+1),μ1f(\mu | d) = d\mu^{-(d+1)}, \quad \mu \ge 1

MLE or Bayesian fits yield global or local ID estimates (Ansuini et al., 2019) (Allegra et al., 2019).

2.3 Angle and Concentration Methods

Separability and angle distribution analysis exploit concentration-of-measure effects in high dimensions. The probability that two normalized random vectors have inner product exceeding threshold α decays exponentially in dimension:

pα(n)=(1α2)(n1)/2α2πnp_\alpha(n) = \frac{(1 - \alpha^2)^{(n-1)/2}}{\alpha \sqrt{2\pi n}}

Solving for nn gives the Fisher-separability–based estimator (Albergante et al., 2019) (Bac et al., 2020).

2.4 Fractal, Correlation, and Morisita Methods

Classical fractal methods (box-counting, Grassberger–Procaccia correlation dimension) fit the scaling law of pairwise counts within distance rr. The Morisita estimator employs grid-based counting and multipoint scaling, robust in high sparsity (Golay et al., 2016).

2.5 Topological and Connectivity-Based Estimators

Topological methods (eDCF) assess local geometric connectivity (e.g., on a quantized grid), matching observed neighbor counts to combinatorial signatures of various manifold dimensions (Gupta et al., 18 Oct 2025).

2.6 Singular Metric Analysis via Generative Models

Spectral analysis of the pullback metric g(z)=Jf(z)Jf(z)g(z) = J_f(z)^\top J_f(z) for the decoder ff of a β-VAE leads to ID estimation via the number of eigenvalues above a threshold (“spectral gap”) (Causin et al., 9 Jul 2025).

2.7 Diffusion Model Rank Deficiency

In score-based diffusion, stacking noisy score vectors for a data point yields a matrix whose rank deficiency directly computes ID as the codimension of the normal space spanned by the score vectors (Roset et al., 14 Nov 2025).

3. Empirical Characterization in Machine Learning

Intrinsic dimension has practical consequences and is measurable in a broad range of domains:

  • Natural images: Despite extrinsic pixel dimensions (e.g., 784 for MNIST, 150,528 for ImageNet), measured IDs range from 7–13 (MNIST) to 26–43 (ImageNet), validating the low-dimensional manifold hypothesis (Pope et al., 2021).
  • Token embeddings in LLMs: Embedding spaces (ED = 128–5120) often have IDs 13–122, with >90% redundancy for large models—ID remains nearly constant as ED grows, producing “saturated” redundancy ratios (Kataiwa et al., 4 Mar 2025).
  • Deep neural networks: Across layers, ID typically “expands” to a peak and then compresses in deeper layers (the “humpback” profile). Final-layer ID is an empirical predictor of test-set accuracy, with lower ID correlating with better generalization (Konz et al., 15 Aug 2024) (Ansuini et al., 2019).
  • Domain dependency: Medical imaging models peak in ID much earlier than natural image models, and peak hidden-layer ID is roughly proportional to raw data ID (empirically, 2×\sim 2\times input ID), demonstrating the data-driven capacity of learned representations (Konz et al., 15 Aug 2024).
  • Molecular dynamics: Time-, space-, and state-resolved ID analysis of macromolecular simulations reveals distinct manifold complexity in folded and unfolded protein states, often more sharply than traditional geometric order parameters (Cazzaniga et al., 17 Nov 2025).

4. Practical Applications: Compression, Adaptation, and Feature Selection

  • Model compression/adaptation: Low-rank adaptation methods such as LoRA benefit from ID-guided rank selection: setting the adapter rank rr \gtrsim estimated ID avoids catastrophic performance degradation in LLM fine-tuning; marginal gains saturate above r1.5×r \sim 1.5 \times ID (Kataiwa et al., 4 Mar 2025).
  • Feature selection: The Morisita estimator drives greedy filter selection in the MBRM algorithm, identifying minimal feature subsets that span the intrinsic manifold, with empirical reduction of dimensionality by 50–75% and no loss in downstream accuracy (Golay et al., 2016).
  • Outlier/anomaly detection: Local ID heterogeneity serves as an unsupervised indicator of structural transitions, e.g., in molecular folding or manifold clustering (Allegra et al., 2019) (Cazzaniga et al., 17 Nov 2025).
  • Scalable geometric learning: Axiomatic, concentration-based IDs (Pestov) quantify discriminability and flattening across neighborhood aggregations in large graphs, informing depth and expressiveness in GNNs and guiding model early stopping (Stubbemann et al., 2022).
  • Diffusion-based data analysis: Higher ID correlates with greater out-of-distribution-ness and richer morphology in astronomical images; classical estimators vastly underestimate ID in high-noise, high-complexity domains (Roset et al., 14 Nov 2025).

5. Scale Dependence and Limitations

ID estimation is inherently scale-dependent (Noia et al., 24 May 2024) (Gupta et al., 18 Oct 2025):

  • At very small scales, measurement noise inflates ID.
  • At large scales, manifold curvature, folding, or topology induce overestimation.
  • Automated protocols like ABIDE select the optimal local neighborhood by maximizing the validity of the constant-density model via likelihood-ratio tests, yielding consistent and asymptotically normal ID estimates.
  • Grid-based and connectivity methods adaptively balance scale and noise to robustly recover integer or fractal dimensions in noisy, high-dimensional scenarios (Gupta et al., 18 Oct 2025).

Discrete data (e.g., sequences, categorical data) require adapted estimators (e.g., I3D) leveraging Poisson-binomial models of counts rather than infinitesimal ball scaling, with verification via empirical CDF congruence and plateau analysis (Macocco et al., 2022).

6. Algorithmic and Software Ecosystem

A mature ecosystem exists for both global and local ID estimation:

  • Python: scikit-dimension (skdim) implements 19 leading linear and nonlinear ID estimators, covering PCA-based, kNN-based, concentration-based, and fractal methodologies with both global and local scope (Bac et al., 2021).
  • R: The “IDmining” package implements the Morisita estimator and MBRM feature selection.
  • C++/Python: FAISS for scalable nearest-neighbor computations.
  • Open-source diffusion and generative-model ID analysis pipelines exist for domain-specific applications (Roset et al., 14 Nov 2025) (Causin et al., 9 Jul 2025).

Selection of estimator and algorithm depends on dataset scale, noise, expected ID range, and whether local or global structure is of analytical interest.

7. Summary Table: Methodological Landscape

Estimator Class Key Formula/Approach Best-use Scenario / Notes
Nearest-neighbor MLE Levina–Bickel, TwoNN Fast, general-purpose; excellent for manifold data
Concentration/Separability Fisher-S, DANCo, ESS Robust to noise, local/global ID, fast after PCA
Fractal (box-counting, Morisita) Scaling pairwise counts, grid counts Non-Euclidean, redundancy analysis
Singular metric (generative) Spectral gap in VAE pullback metric Latent-manifold analysis, inverse problems
Connectivity factor (eDCF) Local lattice neighbor matches Parallelizable, robust to scale, integrated into topology
Diffusion model rank deficiency Rank of stacked scores High-noise, very high-dimensional complex data

Each method has regime-dependent strengths: practitioners are advised to average or ensemble estimates, verify ID plateaus over scale, and use model selection criteria (e.g., likelihood, evidence) where applicable (Bac et al., 2021).


References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Intrinsic Dimension (ID).