Papers
Topics
Authors
Recent
2000 character limit reached

Linear Intrinsic Dimensionality (LID)

Updated 2 December 2025
  • LID is a measure of the minimal number of latent dimensions required to locally describe data manifolds using volume growth or derivative-based estimation.
  • Estimation techniques range from neighbor-based Hill estimators to advanced score- and density-based methods that handle high-dimensional data effectively.
  • Practical applications of LID include identifying adversarial perturbations, enhancing clustering, validating generative models, and informing neural architecture search.

Local Intrinsic Dimensionality (LID) quantifies, at the location of a given data point, the minimal number of latent degrees of freedom—or manifold dimension—needed to locally describe the distribution of its neighbors. Formally, for a dataset in high-dimensional ambient space, LID captures the local dimensionality of the underlying data manifold, directly linking geometric locality, data complexity, and statistical estimation. It is a central tool and theoretical construct for understanding submanifold structure in data, analyzing adversarial perturbations, guiding clustering and segmentation, and benchmarking the efficiency of generative models and neural networks.

1. Mathematical Formalism and Foundational Definitions

LID is defined via local measure or volume growth. For a random variable XRDX\in\mathbb{R}^D with distribution pp, and for a reference point xx, denote by Fx(r)=P(Xxr)F_x(r) = \mathbb{P}(\|X - x\| \leq r) the local probability mass within radius rr of xx. There are several equivalent definitions under mild smoothness conditions:

  • Volume-Growth Definition:

LID(x)=limr0+lnFx(r)lnr\mathrm{LID}(x) = \lim_{r\to 0^{+}} \frac{\ln F_x(r)}{\ln r}

When the data locally populates a dd-dimensional manifold, this limit yields dd (Allegra et al., 2019, Bac et al., 2020, He et al., 2022, Tempczyk et al., 2022).

  • Derivative-Based (Population) Definition:

For absolutely continuous FF,

LID(x)=limr0+rFx(r)Fx(r)\mathrm{LID}(x) = \lim_{r\to 0^+} \frac{r\,F'_x(r)}{F_x(r)}

This form coincides with the tail index of the generalized Pareto law for neighbor distances (Weerasinghe et al., 2021, Amsaleg et al., 2022).

  • Extension to Manifolds:

For xx in a dd-dimensional embedded submanifold MM of RD\mathbb{R}^D, with pp continuous and positive at xx,

LIDp(x)=d\mathrm{LID}_p(x) = d

where dd is the topological dimension of the connected component containing xx (Leung et al., 25 Jun 2025, Yeats et al., 14 Oct 2025).

  • Gaussian Smoothing Connection:

Let ρδ(x)\rho_\delta(x) be the density at xx after smoothing with isotropic Gaussian noise of standard deviation eδe^\delta:

logρδ(x)(dD)δ+O(1),δ\log \rho_\delta(x) \sim (d-D)\,\delta + O(1)\,,\quad \delta\to -\infty

yielding

d=D+limδδlogρδ(x)d = D + \lim_{\delta \to -\infty} \frac{\partial}{\partial \delta} \log \rho_\delta(x)

This identity is central for diffusion-model-based LID estimators (Kamkari et al., 5 Jun 2024, Leung et al., 25 Jun 2025).

2. Estimation Methodologies: Classical and Modern

LID estimation techniques can be broadly classified:

  • Neighbor-Based Maximum Likelihood Estimators:

For kk nearest-neighbor distances r1rkr_1 \leq \ldots \leq r_k from xx,

LID^(x)=(1ki=1klnri(x)rk(x))1\widehat{\mathrm{LID}}(x) = -\left(\frac{1}{k}\sum_{i=1}^k \ln \frac{r_i(x)}{r_k(x)}\right)^{-1}

This Hill estimator is justified by extreme-value theory; its bias decays as O(1/k)O(1/k) and variance as O(1/k)O(1/k) (Lu et al., 2018, Amsaleg et al., 2022, He et al., 2022).

  • Pairwise/Nearby-Only Efficient Estimators:

Tight locality estimators (TLE) aggregate all O(k2)O(k^2) pairwise distances within the kk-neighborhood, greatly reducing variance while maintaining consistency (Amsaleg et al., 2022).

  • Concentration-of-Measure and Linear Separability:

Estimators harness high-dimensional phenomena such as linear separability after whitening and normalization, using closed-form formulas involving the observed inseparability probability and the Lambert WW function (Bac et al., 2020).

  • Parametric Density Estimation:

LIDL uses highly expressive density estimators (normalizing flows, neural likelihoods) to regress logρδ(x)\log \rho_\delta(x) against logδ\log \delta across several noise scales. The slope yields dDd-D, allowing LID estimation in thousands of dimensions and with robustness to boundary noise (Tempczyk et al., 2022).

  • Deep Generative/Score-Based Methods:

The FLIPD estimator employs the learned score of a pretrained diffusion model to compute

FLIPD(x)=D+σ2(t)[Tr(xs^(ψ(t)x,t))+s^(ψ(t)x,t)2]\mathrm{FLIPD}(x) = D + \sigma^2(t)\left[\mathrm{Tr}(\nabla_x \hat s(\psi(t)x, t)) + \|\hat s(\psi(t)x, t)\|^2 \right]

at low noise time tt, requiring only a few Jacobian trace estimates (Kamkari et al., 5 Jun 2024, Leung et al., 25 Jun 2025, Yeats et al., 14 Oct 2025).

Recently, it was shown that the denoising score matching (DSM) loss also lower-bounds LID, enabling highly scalable and memory-efficient estimation via

E(x)=Eξξ/σ+sθ(x+ξ)2LID(x)E(x) = \mathbb{E}_{\xi}\| \xi/\sigma + s_\theta(x+\xi) \|^2 \geq \mathrm{LID}(x)

(Yeats et al., 14 Oct 2025).

3. Theoretical Guarantees and Decomposition Principles

Several key theoretical results establish LID’s foundational status:

  • Consistency and Normality:

Hill-style neighbor-based estimators (and their variants) are consistent and asymptotically normal under mild conditions as kk\to\infty with k/n0k/n\to 0 (Amsaleg et al., 2022).

  • Submanifold Correctness:

It is proven that for data concentrated on a dd-dimensional smooth submanifold, both the classical neighbor-based definition and the smoothing-based definition via the derivative of log-marginal density rigorously recover dd (Leung et al., 25 Jun 2025).

  • Axis-Aligned Decomposition:

The sum of axis-projected LID contributions equals the total LID at a point:

LIDF(x)=i=1mLIDF,i(x)\mathrm{LID}_F(x) = \sum_{i=1}^m \mathrm{LID}_{F,i}(x)

enabling identification of relevant subspaces for clustering and interpretation (Becker et al., 2019).

  • Mixture Models and Segmentation:

LID can be used within mixture models (e.g., using the TWO-NN estimator under Pareto likelihoods) to segment data into regions of constant local dimensionality, revealing underlying structure and heterogeneity in real data (Allegra et al., 2019).

  • Denoising and Normal Dimension Lower Bounds:

It is formally shown that the DSM loss lower bounds the LID, and that the negative implicit score matching loss lower bounds the “normal dimension” ndn-d, connecting score-based model objectives and geometry directly (Yeats et al., 14 Oct 2025).

4. Practical Applications and Implications

LID and its estimators have notable impact across a spectrum of machine learning and data science tasks:

  • Adversarial Example Detection:

Adversarially perturbed inputs consistently exhibit increased LID relative to genuine data, supporting the use of LID as a signature for attack detection. Theoretical lower bounds on LID grow monotonically with perturbation size, rigorously explaining this phenomenon (Lu et al., 2018, Weerasinghe et al., 2021).

  • Outlier and OOD Detection:

Data points lying off the main manifold—due to rarity, noise, or generative artifacts—display anomalous LID, facilitating unsupervised anomaly detection (Leung et al., 25 Jun 2025, Kamkari et al., 5 Jun 2024).

  • Data Segmentation and Clustering:

Hidalgo and related algorithms segment datasets into regions of differing LID, which correlate with interpretable classes of physical or semantic states (e.g., folded vs. unfolded proteins, firm risk strata) (Allegra et al., 2019, Becker et al., 2019).

  • Neural Architecture Search:

LID profiles of subnet activations yield superior separability and efficiency in NAS, outperforming gradient-matching metrics and dramatically reducing GPU memory requirements (He et al., 2022).

  • Density Estimation and Model Validation:

LID estimation via normalizing flows (LIDL) and diffusion models (FLIPD) provides direct quantitative probes into the complexity of learned latent manifolds and enables benchmarking of generative quality and diversity (Tempczyk et al., 2022, Kamkari et al., 5 Jun 2024, Leung et al., 25 Jun 2025).

  • Generalization and Robustness Analysis:

The LID of neural network representations correlates with model generalization and robustness, supporting its adoption in model selection and regularization regimes (Kamkari et al., 5 Jun 2024).

5. Algorithmic and Computational Considerations

Distinct computational properties and trade-offs underlie the main LID estimation methodologies:

Method Sample/Model Requirements Dimensionality Scaling
kNN/Hill (MLE) k100k \sim 100 neighbors Fails for D100D \gg 100
TLE (Tight LID) All pairs in kk-neighborhood Robust for k2050k \sim 20-50
TWO-NN Only 2 NN ratios Fastest among neighbor-based
Concentration PCA+whitening, matrix ops Moderate (D100D \sim 100)
LIDL Multiple density fits (flows) Stable to D4000D \sim 4000
FLIPD (diffusion) One forward pass, Jacobian tr. Scales to D>104D > 10^4
DSM Loss Forward passes only Minimal GPU memory

Neighbor methods are effective and fast in moderate dimensions, but breakdown in very high DD due to boundary bias and sample size constraints (Tempczyk et al., 2022, Amsaleg et al., 2022). Parametric/score-based methods (LIDL, FLIPD, DSM loss) leverage advances in generative modelling; LIDL requires multiple density model fits, FLIPD relies on score networks and Jacobian traces, while DSM loss is both memory-efficient and robust even on massive latent spaces (Kamkari et al., 5 Jun 2024, Leung et al., 25 Jun 2025, Yeats et al., 14 Oct 2025).

6. Limitations, Open Questions, and Future Research

  • Manifold Approximation Limits:

Classical and modern LID estimators assume data generatively populates an embedded manifold with negligible external noise. Deviations—such as stratified, multi-scale, or tubular geometric structures—can bias estimates unless scale parameters are judiciously chosen (Tempczyk et al., 2022, Allegra et al., 2019).

  • Boundary and Curvature Issues:

High-dimension boundary effects and local manifold curvature impose estimator bias and variance that can only partially be mitigated by TLE and smoothing (Amsaleg et al., 2022).

  • Modeling and Computation:

LIDL inherits density model imperfections; FLIPD/DSM may be expensive for very deep models unless approximations (e.g., Hutchinson trace estimators) are used (Kamkari et al., 5 Jun 2024, Yeats et al., 14 Oct 2025).

  • Theoretical Justification:

The correctness of FLIPD is now fully established for general smooth manifolds; analogs for uniform convolution estimators are also rigorously proven, thus unifying multiple strands of LID estimation into a single geometric framework (Leung et al., 25 Jun 2025).

  • Open Directions:

Optimization of hyperparameters (e.g., kk, noise scale), multi-scale and multi-model ensemble estimation, bias-variance balancing for DSM vs. FLIPD, and the design of regularization objectives incorporating LID into supervised learning are promising areas (Yeats et al., 14 Oct 2025).

7. Empirical Benchmarks and Applications in Real Data

Validated across synthetic and real benchmarks, LID estimation delivers meaningful inferences on data structure:

Dataset/Domain Method Key Observations
Protein folding (32D) Hidalgo Folded vs. unfolded states segregate by LID
fMRI voxels (202D) Hidalgo Active vs. inactive areas show high-LID/low-LID segmentation
Corporate balance sheets Hidalgo Financial risk correlates inversely with LID
MNIST, CIFAR, LAION FLIPD, LIDL LID rank closely matches PNG compression/semantic complexity
Deep neural subnets NAS-LID Layer-wise LID vector offers high-fidelity architecture similarity metric

Nonparametric and parametric LID estimators have thus revealed region-, class-, and state-dependent variations in local dimension that correspond with interpretable latent factors and class boundaries. In deep learning, LID has become a central metric for subspace analysis, adversarial vulnerability quantification, and large-scale, pointwise data complexity assessment (Allegra et al., 2019, He et al., 2022, Kamkari et al., 5 Jun 2024, Leung et al., 25 Jun 2025).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Linear Intrinsic Dimensionality (LID).