Papers
Topics
Authors
Recent
Search
2000 character limit reached

Intrinsic Dimension of LLM Representations

Updated 14 January 2026
  • Intrinsic Dimension is defined as the minimal number of degrees of freedom required to describe LLM hidden representations, often far lower than the ambient dimension.
  • Estimation methods like MLE, TwoNN, and ABID leverage local geometric properties—such as neighborhood distances and cosine similarities—to efficiently compute the intrinsic dimension.
  • Empirical findings reveal expansion-compression patterns across LLM layers, where ID insights help assess redundancy, guide low-rank adaptations, and signal privacy vulnerabilities.

Intrinsic dimension (ID) quantifies the minimal number of degrees of freedom required to describe the distribution of LLM hidden representations, independent of their extrinsic dimensionality. For modern transformers, ID offers a geometric lens into the internal structure of learned embeddings, characterizing the local or global manifold on which activations lie. Empirical investigations reveal that, while layer hidden states may live in thousands of ambient dimensions, their manifold dimension is consistently much lower (typically O(10)–O(100)), and highly dependent on model architecture, training dynamics, data domain, and cognitive or linguistic task. ID serves as a core diagnostic for probing representational complexity, redundancy, adaptation under fine-tuning or in-context learning, and privacy risk.

1. Formal Definition and Mathematical Estimators

The ID of a dataset DRD\mathcal D \subset \mathbb{R}^D is defined as the manifold dimension dd satisfying that D\mathcal D lies (approximately) on a dd-dimensional submanifold embedded in RD\mathbb{R}^D (Joshi et al., 25 Nov 2025). Most practical estimators are rooted in local geometry—specifically the scaling of neighborhood distances or angles—under Poisson or isotropy assumptions. Key estimators include:

  • Maximum-Likelihood Estimator (MLE) (Levina & Bickel): For each point xx with kk nearest neighbors at distances (d1,,dk)(d_1, \ldots, d_k),

LIDk(x)=[1k1i=1k1lndkdi]1\mathrm{LID}_k(x) = \left[ \frac{1}{k-1} \sum_{i=1}^{k-1} \ln \frac{d_k}{d_i} \right]^{-1}

and average over points for global ID (Kataiwa et al., 4 Mar 2025).

  • TwoNN Estimator (Facco et al.): Compute the ratio μi=d2/d1\mu_i = d_2/d_1 for each point and solve dd0 via CDF fitting. An ML solution is

dd1

This is commonly used due to its minimal hyperparameters (dd2) and computational efficiency (Arnold, 11 Jun 2025, Pedashenko et al., 19 Nov 2025, Janapati et al., 2024, Baroni et al., 7 Jan 2026).

  • Angle-Based ABID: For dd3-nearest neighbor point clouds, let dd4 be cosine similarities of normalized displacement vectors. The empirical second moment,

dd5

gives the ABID estimator:

dd6

This estimator directly links second-order cosine moments to intrinsic dimension (ABID ≤ affinely spanned dimension) (Thordsen et al., 2020).

  • Spectral/Ritz–Chebyshev: For large samples, the dimension dd7 required to explain a target variance dd8 is estimated by counting leading covariance eigenvalues via Ritz values and Chebyshev projectors without eigendecomposition (Özçoban et al., 12 Mar 2025).

Persistent homology, tight local estimators, and GRIDE (n1, n2)-neighbor generalizations supplement this toolkit depending on geometry and sample regime (Pedashenko et al., 19 Nov 2025, Joshi et al., 25 Nov 2025).

2. Practical Computation on LLM Representations

ID is most often estimated on the following objects:

  1. Embedding matrices (token/word type): Rows are token vectors from the learned embedding matrix.
  2. Hidden states (contextualized, per-layer): For fixed-length sequences, extract and aggregate last-token or per-position hidden activations layerwise.

Estimation procedures follow three broad steps:

  • Neighbor search: For each point in the cloud (token or hidden state), find its dd9 nearest neighbors. For high-dimensional LLM embeddings, FAISS or other approximate methods are recommended (Thordsen et al., 2020, Kataiwa et al., 4 Mar 2025). For ABID, points are typically pre-normalized for cosine geometry.
  • Estimator computation: Distances or angles among neighbors are used per the above estimator formulas.
  • Aggregation: Local ID estimates are averaged (arithmetic or harmonic mean) over all samples, or stratified by layer, data domain, or other metadata.

Computational complexity is D\mathcal D0 to D\mathcal D1 per point, where D\mathcal D2 is the number of samples, D\mathcal D3 is the neighborhood size, and D\mathcal D4 is representation dimension.

For global estimates (entire embedding layer or layerwise hidden states), D\mathcal D5–D\mathcal D6 is common, with stability checks for D\mathcal D7 (Joshi et al., 25 Nov 2025, Kataiwa et al., 4 Mar 2025). For robust local ID via ABID, D\mathcal D8 expected ID is recommended (Thordsen et al., 2020). For spectral (variance) estimators, matrix–vector products scale linearly in D\mathcal D9 and are highly parallelizable (Özçoban et al., 12 Mar 2025).

3. ID in LLM Geometry: Empirical Findings

Redundancy and Compression

Across embedding layers from Word2Vec (ED=300) to Pythia-12B (ED=5120), observed ID is dramatically lower: e.g., GloVe and Word2Vec have ID ≈ 25 (≈8.3% of ED), and large LLMs with ED ≈ 2000–5000 exhibit ID ≈ 25–120 (redundancy >97%) (Kataiwa et al., 4 Mar 2025). During training, ID rapidly collapses from ED to its final value within the first dd0 steps, after which minor refinements occur (“neural collapse”) (Kataiwa et al., 4 Mar 2025). Low ID manifests as an embedding manifold of surprisingly few semantic directions despite nominal parameter scale.

In MCQA and sequence modeling tasks, all tested LLMs display a universal “ID hunchback”: early layers encode with low ID, mid-layers expand and peak (signaling maximum abstraction/complexity), and final layers sharply compress before output (Joshi et al., 25 Nov 2025, Baroni et al., 7 Jan 2026). Peak ID values are 25–100 versus ambient dimensions of thousands.

Domain and Text-Type Structure

ID robustly stratifies by textual genre and function: scientific/technical corpus (arXiv, PubMed) manifests low hidden state ID (~8), general Wikipedia and news are intermediate (~9), and creative/narrative domains (stories, opinion) have high ID (~10.5) (Pedashenko et al., 19 Nov 2025). These effects persist across models and ID estimators.

Steering experiments with sparse autoencoders show that formally “scientific” feature directions produce minor reductions in ID, while narrative/personalization directions increase ID, confirming causality (Pedashenko et al., 19 Nov 2025).

Task, Linguistic, and Learning Paradigm Dependence

In controlled syntax experiments, ID scales strongly with formal syntactic complexity (e.g., subordinated > coordinated sentences by ΔID ~5–20), aligning with mid-layer abstraction peaks, but less so with functional or semantic contrasts (Baroni et al., 7 Jan 2026). Information-theoretic entropy and ID are complementary: after normalizing for length, they are uncorrelated, as ID measures geometric, not prediction, complexity (Pedashenko et al., 19 Nov 2025).

Supervised fine-tuning (SFT) versus in-context learning (ICL): SFT compresses and regularizes representation manifolds (lowers ID), while increasing demonstration count in ICL raises mid- and late-layer ID, plateauing after a few examples (Janapati et al., 2024). High ID under ICL suggests richer, more entangled representations, whereas SFT aligns representations more tightly to labels.

4. Interpretations and Implications

Model Efficiency and Compression: ID quantifies representational redundancy. Selecting LoRA or other low-rank adaptation ranks near ID yields substantial compression without appreciable loss in perplexity (Kataiwa et al., 4 Mar 2025). A low ID signals that most model capacity is “reserve,” only a small subspace is operationally used after training (Kataiwa et al., 4 Mar 2025).

Complexity and Privacy: ID acts as a suppressor of memorization. High-ID sequences in activation space are less likely to be memorized and reproduced by models, especially in low-duplication regimes. Low-ID (boilerplate or stereotypical) sequences are highly at risk for memorization, especially as model scale increases (Arnold, 11 Jun 2025). This positions ID as a diagnostic for privacy leak risk.

Cognitive and Linguistic Probes: The alignment of ID peaks and complexity markers with formal syntactic operations, and the universality of these signatures across architectures and data conditions, reinforces the utility of ID as a domain-general probe of abstraction and structural computation (Joshi et al., 25 Nov 2025, Baroni et al., 7 Jan 2026).

Quality Control and Diagnostics: Extremely low or high ID can signal degenerate (looping) or incoherent generation, respectively, serving as a filter for output audits (Pedashenko et al., 19 Nov 2025).

5. Limitations, Best Practices, and Future Directions

Estimator Assumptions: Distance-based estimators assume locally homogeneous Poisson processes or isotropic geometry, which may break down in highly anisotropic LLM embeddings. Angle-based (ABID) estimators are more robust to ambient noise and cluster boundaries but still presuppose local uniformity (Thordsen et al., 2020). Piecewise manifold geometries require local or clusterwise estimation.

Hyperparameter Sensitivity: Choice of dd1 and neighborhood selection affect local and global ID stability. Recommended practice is to verify ID plateauing as dd2 is varied (Joshi et al., 25 Nov 2025, Thordsen et al., 2020). For persistent homology and spectral methods, sample size and rank thresholds must be set based on data variance and task (Pedashenko et al., 19 Nov 2025, Özçoban et al., 12 Mar 2025).

Application Scope: ID is not a universal “difficulty” metric: its proper interpretation is always relative to domain, text type, and downstream metrics, and should be supplemented with entropy and anisotropy (Pedashenko et al., 19 Nov 2025). For short texts (dd3150 tokens), variance in ID estimates is elevated.

Emergent Directions: Active research is extending ID analysis to training checkpoints (studying “when” and “where” low-dimensional manifolds form), model scale (beyond 100B parameters and Mixture-of-Experts), cross-linguistic variation, topological characterizations beyond manifold dimension, and the mechanistic connection to attention and circuit-level organization (Baroni et al., 7 Jan 2026, Joshi et al., 25 Nov 2025, Pedashenko et al., 19 Nov 2025). The relationship between ID plateaus/compression phases and decisiveness under different architectural or learning interventions remains a major focus.

Estimator Input Core Formula / Principle Typical ID Range (LLMs)
MLE (Levina–Bickel) Embeddings, hidden Local log-ratio of neighbor distances, mean/harmonic mean global 20–120 (embedding), 30–100 (hidden) (Kataiwa et al., 4 Mar 2025, Joshi et al., 25 Nov 2025)
TwoNN (Facco) Embeddings, hidden Pareto of 2nd/1st NN; CDF fit 8–30 (per-sequence), 20–100 (layerwise) (Arnold, 11 Jun 2025, Janapati et al., 2024)
ABID (Angle-based) Embeddings, hidden Second moment of k-NN cosine similarities; dd4 5–50 (typical local ID for LLM) (Thordsen et al., 2020)
Persistent Homology/PHD Embeddings, hidden Scaling of homology count / minimal spanning tree length 7–13 (document); stratifies genre (Pedashenko et al., 19 Nov 2025)
Ritz–Chebyshev Spectral Embeddings, hidden Eigenvalue count for dd5 variance by mat–vec products O(10)–O(100); high efficiency (Özçoban et al., 12 Mar 2025)

Empirical trends include: (i) ID dd6 extrinsic dimension throughout; (ii) high redundancy in large models; (iii) domain and genre stratification; (iv) expansion–compression ID profiles per layer; (v) higher ID during ICL vs SFT; (vi) low ID as a marker of memorization and privacy risk.

7. References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Intrinsic Dimension (ID) of LLM Representations.