Intrinsic Dimensionality

Updated 21 December 2025

Intrinsic dimensionality (ID) is the minimal number of variables required to capture the core structure of high-dimensional data and manifolds.
It is estimated using methods such as MLE, TwoNN, and grid-based techniques that analyze local scaling behavior and geometric properties.
ID informs practical applications in deep neural networks, molecular dynamics, and graph learning by guiding model complexity and segmentation.

Intrinsic dimensionality (ID) quantifies the minimal number of variables or degrees of freedom required to accurately describe a data set, a manifold, or a representation within an ambient high-dimensional space. For data clouds, ID is formalized as the smallest value d such that at sufficiently small scales, relevant statistics—such as the volume of neighborhoods, distributions of distances, or other geometric properties—behave as if the data were sampled from a d-dimensional structure, regardless of ambient dimension. As such, ID provides a fundamental metric for understanding data complexity, guiding the choice of modeling and dimensionality reduction techniques, and informing algorithmic considerations across manifold learning, representation analysis in deep neural networks, feature selection, and beyond.

1. Formal Definitions and Theoretical Foundations

The dominant mathematical formalization of ID relies on the assumption that points are sampled (possibly with noise or redundancy) from or near a d-dimensional manifold $\mathcal{M} \subset \mathbb{R}^D$ , with $d \ll D$ . The intrinsic dimension is then $d = \dim(\mathcal{M})$ in the topological or manifold sense. For point clouds, ID is operationalized by the local scaling behavior:

For a point $x$ , the number of neighbors within radius $r$ grows as $N(r) \propto r^d$ as $r \to 0$ .
For the cumulative distribution function $F(r)$ of inter-point distances, the local ID at $x$ is $d = r F'(r)/F(r)$ (Amsaleg et al., 2022).
Box-counting and correlation dimension approaches relate the covering number $N(\epsilon)$ to scale: $\log N(\epsilon)/\log(1/\epsilon)$ as $\epsilon \to 0$ (Eser et al., 13 Nov 2025).

Alternative formalizations, including axiomatic frameworks grounded in the concentration of measure, define ID via discriminability or observable diameter integrals over all features, leading to the notion that ID is also the inverse square of the average observable diameter across levels of feature concentration (Stubbemann et al., 2022).

Redundant data and highly curved or clustered structures motivate data-driven and local definitions: manifold regions of varying complexity can be assigned pointwise or local ID, making the concept robust to heterogeneity (Allegra et al., 2019).

2. Methodologies for Intrinsic Dimension Estimation

A wide spectrum of ID estimators exists, each tailored to different data regimes and assumptions:

Nearest-Neighbor–Based Estimators:
- Maximum Likelihood Estimator (MLE)/Hill Estimator: At each point $x$ , the ID is estimated by the inverse of the average log-ratio of the $k$ nearest-neighbor distances:
- $\hat d_{\mathrm{MLE}}(x) = \left( \frac{1}{k-1} \sum_{j=1}^{k-1} \ln \frac{r_k}{r_j} \right)^{-1}$ (Kataiwa et al., 4 Mar 2025, Amsaleg et al., 2022).
- TwoNN/Ratio Estimator: Uses only the first and second nearest-neighbor distances, exploiting the Pareto distribution of the ratio $\mu = r_2/r_1$ , leading to $d=1/\mathbb{E}[\ln{\mu}]$ (Ansuini et al., 2019, Razmjoo et al., 14 Dec 2025).
- ABIDE: Combines the BIDE estimator—based on binomial statistics for counts inside concentric balls—with an automated scale-selection protocol to avoid noise and curvature regimes, yielding robust, adaptive ID with controlled uncertainty (Noia et al., 24 May 2024).
Angle- and Concentration-Based Estimators:
- Fisher Separability (FisherS): Measures the probability that a point cannot be linearly separated from the rest after whitening and normalization. The observed inseparability is inverted to estimate dimension via a Lambert W function (Bac et al., 2020, Albergante et al., 2019).
- ABID (Angle-Based Intrinsic Dimensionality): Utilizes the distribution of pairwise angles among neighborhood vectors, exploiting the moment $\mathbb{E}[\cos^2 \theta] = 1/d$ for points on $S^{d-1}$ (Thordsen et al., 2020).
Grid and Connectivity-Based Estimators:
- eDCF (Empirically-Weighted Distributed Connectivity Factor): Projects data onto a discrete grid, computes the "connectivity factor" for each cell, and then interpolates these against theoretical or synthetic reference models to infer ID, robust to scale and noise (Gupta et al., 18 Oct 2025).
Manifold and Graph-Based Approaches:
- Axiomatic/Observable Diameter-Based ID: Defines ID for geometric data sets or graphs via integrals over observable diameters of real-valued feature sets (including k-hop aggregated features), resulting in an ID that reflects both Euclidean and graph-induced complexities and is scalable to massive data (Stubbemann et al., 2022).
Specialized Methods:
- I³D (for Discrete Metrics): Adapts binomial MLE and lattice volume enumeration to Hamming/L1-discrete spaces, crucial for categorical, genomic or sequence data that violate continuous-manifold assumptions (Macocco et al., 2022).
- Morisita ID (Fractal/Spatial Index): Employs spatial count statistics over a range of grid scales, regressing the log index vs. log scale to infer fractal or self-similar dimension, and can drive feature selection pipelines (Golay et al., 2016).
Software and Benchmarking:
- scikit-dimension: Implements 19 estimators, including all above, and provides systematic guidance and large-scale benchmarking on synthetic and real-world datasets, emphasizing method selection according to data size, dimensionality, and application (Bac et al., 2021).

3. Practical Considerations and Scale Dependence

Estimation of ID is fundamentally scale-sensitive:

Small scales: Noisy measurement artifacts dominate, leading to overestimation.
Large scales: Manifold curvature and topology inflate apparent dimension (Noia et al., 24 May 2024).
The correct scale—characterized by locally constant density and negligible curvature—must be determined in a data-driven manner; ABIDE and eDCF are explicit in automatically identifying this "sweet spot" (Noia et al., 24 May 2024, Gupta et al., 18 Oct 2025).
Grid-based and connectivity methods provide alternative strategies that are less sensitive to metric choices, allowing robust multi-scale inference (Gupta et al., 18 Oct 2025).

Practical parameter choices include the number of neighbors $k$ , radius selection, or regularization constants. Best-practice guidelines universally recommend sensitivity analysis over these parameters and caution in interpreting ID estimates on sparse or heterogeneous data (Bac et al., 2020, Golay et al., 2016).

4. Applications Across Scientific Domains

Deep Neural Network Representations:

Layerwise analysis of ID in deep networks reveals a characteristic non-monotonic profile: initial layers rapidly expand ID (decorrelation, partial whitening), middle layers reach a peak that closely tracks twice the ID of the data, and later layers compress representations onto a low-dimensional manifold. Lower ID in the final hidden layer is tightly correlated with generalization performance, regardless of architecture or training domain (Ansuini et al., 2019, Konz et al., 15 Aug 2024). Similar phenomena are observed for token or word embedding spaces, with redundancy quantified as $(\mathrm{ED} - \mathrm{ID})/\mathrm{ED}$ approaching 98% for LLMs (Kataiwa et al., 4 Mar 2025).

Molecular and Biological Data:

In molecular dynamics, ID quantifies the number of independent collective motions of biomolecules, distinguishing conformational phases (e.g., folded vs. unfolded) and localizing flexibility to specific sequence regions. MDIntrinsicDimension provides dedicated estimators for time-resolved and segmental analysis (Cazzaniga et al., 17 Nov 2025). In quantum chemistry, property-specific local ID can be directly estimated by eigenanalysis of the Hessian of the property with respect to all continuous atomic variables, revealing that chemically meaningful accuracy tolerances yield drastic dimension reduction, stabilizing across molecule classes (Banjafar et al., 3 Jul 2025).

Manifold Segmentation and Imbalance Quantification:

Local ID enables unsupervised segmentation of data into regions or clusters of differing complexity, outperforming density-based clustering in scenarios like protein folding, neuroimaging, and finance (Allegra et al., 2019). In imbalanced classification problems, classwise ID provides a model-free measure of "geometric difficulty" superior to cardinality, yielding improved sampling and weighting strategies, particularly when rare classes inhabit more complex manifolds (Eser et al., 13 Nov 2025).

Discrete/Categorical and Graph Domains:

Extensions to categorical and discrete metric data—via I³D—address genomics and survey data where Euclidean assumptions fail, uncovering unexpectedly low-dimensional evolutionary constraints (Macocco et al., 2022). For graph learning, axiomatic ID grounded in the concentration of measure quantifies how neighborhood aggregation compresses or expands representation complexity, directly paralleling and predicting accuracy gains in geometric deep learning (Stubbemann et al., 2022).

5. Limitations, Method Comparison, and Best Practices

ID estimation is subject to several limitations:

Finite-Sample Effects: Many estimators are asymptotically unbiased, but require sufficient sample sizes, especially for high true dimensions (typically $N > 10^2 \times d$ ) (Bac et al., 2020).
Scale and Noise Biases: Improper scale or neighborhood choices lead to substantial over- or under-estimation; ABIDE and eDCF are explicit countermeasures (Noia et al., 24 May 2024, Gupta et al., 18 Oct 2025).
Manifold Assumptions: Most methods require locally uniform density and approximate Euclideanity; extreme heterogeneity may defeat global ID estimation but can be mitigated by local or mixture models (Allegra et al., 2019).
Computational Scaling: Many algorithms are $O(N^2)$ due to pairwise distance computations, with scalable alternatives (e.g., grid/discrete/connectivity approaches or approximations in scikit-dimension) available for large $N$ (Gupta et al., 18 Oct 2025, Bac et al., 2021).
Interpretability: Non-integer ID estimates can reflect noise, boundaries, or local mixing of manifold and non-manifold subpopulations (Albergante et al., 2019).

Comparison Across Methods:

Benchmarking on standard datasets finds:

PCA-based methods are fast but miss nonlinear or locally variable manifold structure and are highly sensitive to redundant features (Bac et al., 2021).
MLE/TwoNN are broadly robust for moderate-to-large $N$ and $d$ , fast, and widely adopted as baselines (Ansuini et al., 2019, Bac et al., 2021).
FisherS/TLE/ABID excel under heterogeneity, high noise, and allow for local profile analysis, often at $O(k^2)$ cost (Thordsen et al., 2020, Bac et al., 2020, Amsaleg et al., 2022).
Grid/eDCF are robust in topology-driven settings, especially under medium to high noise and for discrete/categorical data (Gupta et al., 18 Oct 2025, Macocco et al., 2022).

Best practices recommend consensus ID estimates via ensemble averaging, careful calibration of scale/neighbor parameters, and analysis of both global and local ID profiles in exploratory data workflows (Bac et al., 2021).

6. Emerging Directions and Theoretical Advances

Recent work extends ID beyond classical manifold analysis:

Relative Intrinsic Dimension: Measures the pairwise separability between distributions $P$ and $Q$ , defining $D(P,Q)$ via the log-probability that a point from $P$ is linearly separable from a point of $Q$ . This directly bounds learnability and generalization in binary classification (Sutton et al., 2023).
Fractal/Topological ID: Grid connectivity and box-counting methods estimate fractal dimensions of classifier boundaries and non-manifold supports (Gupta et al., 18 Oct 2025).
Adversarial Detection via ID: Variation in the ID of input–loss gradients serves as a geometric fingerprint for adversarial samples in deep learning; adversarial perturbations collapse ID, distinguishable from natural data (Razmjoo et al., 14 Dec 2025).

Open questions include robust estimation under strong inhomogeneity, adaptation to streaming or online settings, theoretical error bounds at finite $N$ , and generalization beyond Euclidean metrics or continuous data. Algorithmic advances emphasize scalability (e.g., for $N \sim 10^8$ ) and integration with geometric and graph-based modeling (Stubbemann et al., 2022, Gupta et al., 18 Oct 2025).

In sum, intrinsic dimensionality provides a foundational, theoretically grounded, and practically actionable measure of data complexity. It is indispensable for understanding, analyzing, and reducing high-dimensional data across scientific, engineering, and data-driven disciplines, with continuing methodological innovation broadening its relevance and applicability (Ansuini et al., 2019, Cazzaniga et al., 17 Nov 2025, Stubbemann et al., 2022, Bac et al., 2021).