Linear Intrinsic Dimensionality (LID)

Updated 2 December 2025

LID is a measure of the minimal number of latent dimensions required to locally describe data manifolds using volume growth or derivative-based estimation.
Estimation techniques range from neighbor-based Hill estimators to advanced score- and density-based methods that handle high-dimensional data effectively.
Practical applications of LID include identifying adversarial perturbations, enhancing clustering, validating generative models, and informing neural architecture search.

Local Intrinsic Dimensionality (LID) quantifies, at the location of a given data point, the minimal number of latent degrees of freedom—or manifold dimension—needed to locally describe the distribution of its neighbors. Formally, for a dataset in high-dimensional ambient space, LID captures the local dimensionality of the underlying data manifold, directly linking geometric locality, data complexity, and statistical estimation. It is a central tool and theoretical construct for understanding submanifold structure in data, analyzing adversarial perturbations, guiding clustering and segmentation, and benchmarking the efficiency of generative models and neural networks.

1. Mathematical Formalism and Foundational Definitions

LID is defined via local measure or volume growth. For a random variable $X\in\mathbb{R}^D$ with distribution $p$ , and for a reference point $x$ , denote by $F_x(r) = \mathbb{P}(\|X - x\| \leq r)$ the local probability mass within radius $r$ of $x$ . There are several equivalent definitions under mild smoothness conditions:

Volume-Growth Definition:

$\mathrm{LID}(x) = \lim_{r\to 0^{+}} \frac{\ln F_x(r)}{\ln r}$

When the data locally populates a $d$ -dimensional manifold, this limit yields $d$ (Allegra et al., 2019, Bac et al., 2020, He et al., 2022, Tempczyk et al., 2022).

Derivative-Based (Population) Definition:

For absolutely continuous $F$ ,

$\mathrm{LID}(x) = \lim_{r\to 0^+} \frac{r\,F'_x(r)}{F_x(r)}$

This form coincides with the tail index of the generalized Pareto law for neighbor distances (Weerasinghe et al., 2021, Amsaleg et al., 2022).

Extension to Manifolds:

For $x$ in a $d$ -dimensional embedded submanifold $M$ of $\mathbb{R}^D$ , with $p$ continuous and positive at $x$ ,

$\mathrm{LID}_p(x) = d$

where $d$ is the topological dimension of the connected component containing $x$ (Leung et al., 25 Jun 2025, Yeats et al., 14 Oct 2025).

Gaussian Smoothing Connection:

Let $\rho_\delta(x)$ be the density at $x$ after smoothing with isotropic Gaussian noise of standard deviation $e^\delta$ :

$\log \rho_\delta(x) \sim (d-D)\,\delta + O(1)\,,\quad \delta\to -\infty$

yielding

$d = D + \lim_{\delta \to -\infty} \frac{\partial}{\partial \delta} \log \rho_\delta(x)$

This identity is central for diffusion-model-based LID estimators (Kamkari et al., 5 Jun 2024, Leung et al., 25 Jun 2025).

2. Estimation Methodologies: Classical and Modern

LID estimation techniques can be broadly classified:

Neighbor-Based Maximum Likelihood Estimators:

For $k$ nearest-neighbor distances $r_1 \leq \ldots \leq r_k$ from $x$ ,

$\widehat{\mathrm{LID}}(x) = -\left(\frac{1}{k}\sum_{i=1}^k \ln \frac{r_i(x)}{r_k(x)}\right)^{-1}$

This Hill estimator is justified by extreme-value theory; its bias decays as $O(1/k)$ and variance as $O(1/k)$ (Lu et al., 2018, Amsaleg et al., 2022, He et al., 2022).

Pairwise/Nearby-Only Efficient Estimators:

Tight locality estimators (TLE) aggregate all $O(k^2)$ pairwise distances within the $k$ -neighborhood, greatly reducing variance while maintaining consistency (Amsaleg et al., 2022).

Concentration-of-Measure and Linear Separability:

Estimators harness high-dimensional phenomena such as linear separability after whitening and normalization, using closed-form formulas involving the observed inseparability probability and the Lambert $W$ function (Bac et al., 2020).

Parametric Density Estimation:

LIDL uses highly expressive density estimators (normalizing flows, neural likelihoods) to regress $\log \rho_\delta(x)$ against $\log \delta$ across several noise scales. The slope yields $d-D$ , allowing LID estimation in thousands of dimensions and with robustness to boundary noise (Tempczyk et al., 2022).

Deep Generative/Score-Based Methods:

The FLIPD estimator employs the learned score of a pretrained diffusion model to compute

$\mathrm{FLIPD}(x) = D + \sigma^2(t)\left[\mathrm{Tr}(\nabla_x \hat s(\psi(t)x, t)) + \|\hat s(\psi(t)x, t)\|^2 \right]$

at low noise time $t$ , requiring only a few Jacobian trace estimates (Kamkari et al., 5 Jun 2024, Leung et al., 25 Jun 2025, Yeats et al., 14 Oct 2025).

Recently, it was shown that the denoising score matching (DSM) loss also lower-bounds LID, enabling highly scalable and memory-efficient estimation via

$E(x) = \mathbb{E}_{\xi}\| \xi/\sigma + s_\theta(x+\xi) \|^2 \geq \mathrm{LID}(x)$

(Yeats et al., 14 Oct 2025).

3. Theoretical Guarantees and Decomposition Principles

Several key theoretical results establish LID’s foundational status:

Consistency and Normality:

Hill-style neighbor-based estimators (and their variants) are consistent and asymptotically normal under mild conditions as $k\to\infty$ with $k/n\to 0$ (Amsaleg et al., 2022).

Submanifold Correctness:

It is proven that for data concentrated on a $d$ -dimensional smooth submanifold, both the classical neighbor-based definition and the smoothing-based definition via the derivative of log-marginal density rigorously recover $d$ (Leung et al., 25 Jun 2025).

Axis-Aligned Decomposition:

The sum of axis-projected LID contributions equals the total LID at a point:

$\mathrm{LID}_F(x) = \sum_{i=1}^m \mathrm{LID}_{F,i}(x)$

enabling identification of relevant subspaces for clustering and interpretation (Becker et al., 2019).

Mixture Models and Segmentation:

LID can be used within mixture models (e.g., using the TWO-NN estimator under Pareto likelihoods) to segment data into regions of constant local dimensionality, revealing underlying structure and heterogeneity in real data (Allegra et al., 2019).

Denoising and Normal Dimension Lower Bounds:

It is formally shown that the DSM loss lower bounds the LID, and that the negative implicit score matching loss lower bounds the “normal dimension” $n-d$ , connecting score-based model objectives and geometry directly (Yeats et al., 14 Oct 2025).

4. Practical Applications and Implications

LID and its estimators have notable impact across a spectrum of machine learning and data science tasks:

Adversarial Example Detection:

Adversarially perturbed inputs consistently exhibit increased LID relative to genuine data, supporting the use of LID as a signature for attack detection. Theoretical lower bounds on LID grow monotonically with perturbation size, rigorously explaining this phenomenon (Lu et al., 2018, Weerasinghe et al., 2021).

Outlier and OOD Detection:

Data points lying off the main manifold—due to rarity, noise, or generative artifacts—display anomalous LID, facilitating unsupervised anomaly detection (Leung et al., 25 Jun 2025, Kamkari et al., 5 Jun 2024).

Data Segmentation and Clustering:

Hidalgo and related algorithms segment datasets into regions of differing LID, which correlate with interpretable classes of physical or semantic states (e.g., folded vs. unfolded proteins, firm risk strata) (Allegra et al., 2019, Becker et al., 2019).

Neural Architecture Search:

LID profiles of subnet activations yield superior separability and efficiency in NAS, outperforming gradient-matching metrics and dramatically reducing GPU memory requirements (He et al., 2022).

Density Estimation and Model Validation:

LID estimation via normalizing flows (LIDL) and diffusion models (FLIPD) provides direct quantitative probes into the complexity of learned latent manifolds and enables benchmarking of generative quality and diversity (Tempczyk et al., 2022, Kamkari et al., 5 Jun 2024, Leung et al., 25 Jun 2025).

Generalization and Robustness Analysis:

The LID of neural network representations correlates with model generalization and robustness, supporting its adoption in model selection and regularization regimes (Kamkari et al., 5 Jun 2024).

5. Algorithmic and Computational Considerations

Distinct computational properties and trade-offs underlie the main LID estimation methodologies:

Method	Sample/Model Requirements	Dimensionality Scaling
kNN/Hill (MLE)	$k \sim 100$ neighbors	Fails for $D \gg 100$
TLE (Tight LID)	All pairs in $k$ -neighborhood	Robust for $k \sim 20-50$
TWO-NN	Only 2 NN ratios	Fastest among neighbor-based
Concentration	PCA+whitening, matrix ops	Moderate ( $D \sim 100$ )
LIDL	Multiple density fits (flows)	Stable to $D \sim 4000$
FLIPD (diffusion)	One forward pass, Jacobian tr.	Scales to $D > 10^4$
DSM Loss	Forward passes only	Minimal GPU memory

Neighbor methods are effective and fast in moderate dimensions, but breakdown in very high $D$ due to boundary bias and sample size constraints (Tempczyk et al., 2022, Amsaleg et al., 2022). Parametric/score-based methods (LIDL, FLIPD, DSM loss) leverage advances in generative modelling; LIDL requires multiple density model fits, FLIPD relies on score networks and Jacobian traces, while DSM loss is both memory-efficient and robust even on massive latent spaces (Kamkari et al., 5 Jun 2024, Leung et al., 25 Jun 2025, Yeats et al., 14 Oct 2025).

6. Limitations, Open Questions, and Future Research

Manifold Approximation Limits:

Classical and modern LID estimators assume data generatively populates an embedded manifold with negligible external noise. Deviations—such as stratified, multi-scale, or tubular geometric structures—can bias estimates unless scale parameters are judiciously chosen (Tempczyk et al., 2022, Allegra et al., 2019).

Boundary and Curvature Issues:

High-dimension boundary effects and local manifold curvature impose estimator bias and variance that can only partially be mitigated by TLE and smoothing (Amsaleg et al., 2022).

Modeling and Computation:

LIDL inherits density model imperfections; FLIPD/DSM may be expensive for very deep models unless approximations (e.g., Hutchinson trace estimators) are used (Kamkari et al., 5 Jun 2024, Yeats et al., 14 Oct 2025).

Theoretical Justification:

The correctness of FLIPD is now fully established for general smooth manifolds; analogs for uniform convolution estimators are also rigorously proven, thus unifying multiple strands of LID estimation into a single geometric framework (Leung et al., 25 Jun 2025).

Open Directions:

Optimization of hyperparameters (e.g., $k$ , noise scale), multi-scale and multi-model ensemble estimation, bias-variance balancing for DSM vs. FLIPD, and the design of regularization objectives incorporating LID into supervised learning are promising areas (Yeats et al., 14 Oct 2025).

7. Empirical Benchmarks and Applications in Real Data

Validated across synthetic and real benchmarks, LID estimation delivers meaningful inferences on data structure:

Dataset/Domain	Method	Key Observations
Protein folding (32D)	Hidalgo	Folded vs. unfolded states segregate by LID
fMRI voxels (202D)	Hidalgo	Active vs. inactive areas show high-LID/low-LID segmentation
Corporate balance sheets	Hidalgo	Financial risk correlates inversely with LID
MNIST, CIFAR, LAION	FLIPD, LIDL	LID rank closely matches PNG compression/semantic complexity
Deep neural subnets	NAS-LID	Layer-wise LID vector offers high-fidelity architecture similarity metric

Nonparametric and parametric LID estimators have thus revealed region-, class-, and state-dependent variations in local dimension that correspond with interpretable latent factors and class boundaries. In deep learning, LID has become a central metric for subspace analysis, adversarial vulnerability quantification, and large-scale, pointwise data complexity assessment (Allegra et al., 2019, He et al., 2022, Kamkari et al., 5 Jun 2024, Leung et al., 25 Jun 2025).