Token Embedding Concentration
- Token embedding concentration is a phenomenon where large language models’ token vectors cluster in a small, structured subspace, leading to redundancy and underutilized capacity.
- Empirical analyses reveal that models like Llama-8B use less than 10% of their theoretical embedding capacity, exposing significant information bottlenecks.
- Geometric studies and intrinsic dimension evaluations show that embeddings collapse onto low-dimensional manifolds, enabling aggressive compression and targeted model improvements.
Token embedding concentration refers to the geometric, statistical, and information-theoretic phenomena observed in the distribution and utilization of token embedding vectors in LLMs. Concentration here characterizes how embedding vectors—whether used for input representations, memory compression, parameter keys, or output prediction—tend to occupy a disproportionately small, highly structured subset of the available embedding space, often leading to redundancy, singularities, or underutilization of potential capacity. Recent empirical and theoretical studies have characterized this phenomenon through manifold analysis, information capacity, angular geometry, and the emergence of dominant probability directions. Token embedding concentration has direct implications for efficiency, storage, prompt stability, interpretability, and the fundamental limits of LLM architectures.
1. Information Capacity and Compression Limits
The information-theoretic upper limit of what an embedding vector can encode is determined by its dimension () and numerical precision ( bits per coordinate). In the noise-free setting, the maximal number of bits storable is , and a vector could, in principle, represent up to distinct tokens for a vocabulary (Kuratov et al., 18 Feb 2025). However, even with brute-force per-sample embedding optimization, empirical studies find that state-of-the-art LLMs (e.g., Llama-3.1-8B) can losslessly reconstruct just over 1,500 tokens from a single vector—yielding a two-orders-of-magnitude gap between realized (~) and theoretical () capacity. This deficit arises from factors including embedding entanglement, frozen model weights, quantization noise, and the model's inherent uncertainty as measured by cross-entropy loss. The observed concentration thus reflects not merely suboptimal encoding algorithms but deep representational bottlenecks in current architectures.
| Model | Embedding Size (b=16) | Max Tokens (Empirical) | Theoretical Limit | Utilized Capacity |
|---|---|---|---|---|
| Llama-8B | 1 vector, 16-bit | 1,568 | ||
| Pythia-160M | 1 vector, 16-bit | 80 | 0 |
Measured utilization rates remain below 1 of the ideal, highlighting vast untapped potential in embedding space design (Kuratov et al., 18 Feb 2025).
2. Geometric Structure and Manifold Hypotheses
Early models posited that embeddings are distributed on smooth, low-dimensional manifolds within high-dimensional Euclidean space. However, statistical testing demonstrates widespread and significant violations of manifold and even fiber-bundle hypotheses in real LLM embeddings (Robinson et al., 1 Apr 2025). The fiber-bundle null generalizes the manifold assumption by allowing local signal-plus-noise structure but still mandates that the growth rate of embedding-ball volumes is non-increasing with radius. Empirical rejection rates for this hypothesis—7 tokens in GPT2, 1–2 in Llemma7B and Mistral7B, and dozens for manifold criteria—indicate localized geometric singularities such as prefix tokens, whitespace fragments, or polysemous units.
The existence of these singularities implies that:
- Local neighborhoods of certain tokens are highly irregular, exhibiting upward jumps in local dimension.
- Geodesic paths and metric-based interpretability methods are unreliable in such regions.
- Prompt stability and cross-model portability are undermined when prompts intersect singular tokens.
These findings refute the long-standing assumption that embedding spaces are locally uniform and smoothly structured (Robinson et al., 1 Apr 2025).
3. Intrinsic Dimension and Redundancy
Intrinsic Dimension (ID) analysis provides a quantitative estimate of the minimal number of coordinates needed to describe the support of token embeddings in a given space. Using nearest-neighbor estimators, it has been shown that the effective ID of embedding sets is substantially lower than their extrinsic dimension, with redundancy rates increasing as model scale grows (Kataiwa et al., 4 Mar 2025). For example, while input or output embeddings may have nominal dimension 2, the measured ID may be an order of magnitude smaller, especially for larger models.
ID estimates reveal:
- Embedding spaces in both word2vec-like and LLM-type models are collapsed onto low-dimensional submanifolds.
- Rapid ID reduction occurs early in training, consistent with fast emergence of statistical structure.
- ID serves as an empirical bound on optimally effective LoRA rank, and a guideline for pruning and compression.
Thus, redundancy and concentration are fundamental and dynamic properties of embedding spaces, not mere artifacts of initialization or architecture.
4. Emergence of Low-Dimensional Probability Encodings
Output-token probability information, critical for causal language modeling, is concentrated along a sparse set of directions in the output embedding space. For LLM heads, averaged log-probabilities over contexts can be regressed as a single “frequency direction”—the dot product of the output embedding vector with the (negative) mean hidden state direction—with adjusted 3–4 in various models (Cho et al., 2024). Only 10–20% of embedding dimensions are implicated in this encoding; the remainder can be pruned with negligible impact on output entropy or generation quality.
Key properties include:
- Early in pre-training, token frequency structure is already encoded in embeddings well before overall parameter convergence.
- Editing the small set of informative dimensions enables causal steering of token probabilities at runtime.
- Redundancy in non-probability dimensions opens the door to aggressive embedding compression.
This log-linear, directionally concentrated structure also influences model architecture design, motivating the separation or explicit allocation for probability axes in output heads.
5. Angular Geometry, Mutual Coherence, and Architectural Implications
Token embedding concentration is manifest in angular geometry: embeddings often exhibit high pairwise cosine similarity or align into narrow cones within the space. Standard FFN up-projection matrices (e.g., in SwiGLU layers) produce content keys that are highly concentrated, lying in low-dimensional subspaces and exhibiting substantial mutual coherence (Sadhukhan et al., 15 Jan 2026). This superposition effect clamps down on the effective number of orthogonal “memory slots,” impeding rare token retrieval and limiting parametric storage.
Novel architectures such as STEM (Scaling Transformers with Embedding Modules) directly address this by replacing dense up-projections with token-indexed, unit-normalized embeddings that maximize angular spread. Empirically, STEM reduces mean cosine similarity to near zero (implying orthogonalization), enhances knowledge storage, interprets embeddings as “micro-experts,” and substantially boosts performance on long-context and knowledge-demanding benchmarks. Crucially, increased angular spread mitigates parameter interference and supports efficient, interpretable knowledge editing (Sadhukhan et al., 15 Jan 2026).
6. Dynamics of Concentration: Emergence and Optimization
Concentration phenomena arise both as a consequence of optimization and implicit biases in gradient descent. Theoretical analysis of shallow attention architectures reveals that, by the first gradient step, token embeddings align with the regression/output vector proportionally to the signed occurrence frequency in the data (Wu et al., 22 May 2025). Subsequently, attention-selection directions converge to maximize separation (margin) between informative and non-informative tokens, reflecting sparse “hard-attention” effects seen in LLMs. Empirical results on IMDB and Yelp datasets show that cosine alignment with output vectors correlates linearly with the predictiveness of tokens, and that softmax attention mass concentrates on a few key positions after training.
This dynamic process cements the dual concentration along frequency-informed and attention-selection directions, suggesting that the observed phenomena are not incidental but core to the statistical mechanics of deep attention models (Wu et al., 22 May 2025).
7. Practical Implications and Open Questions
The concentration of token embeddings has wide-ranging consequences:
- Prompt Stability and Portability: Singularities and discretized strata in embedding space lead to heteroscedastic output and non-portable prompts across models and tokenizations (Robinson et al., 1 Apr 2025).
- Compression: Embedding pruning and structured compression are enabled by low intrinsic dimension and sparse probability axes, with empirical evidence supporting safe removal of >30% of embedding dimensions without loss (Cho et al., 2024).
- Interpretability and Editing: Token-indexed and angularly spread embeddings allow for targeted knowledge editing, semantic diagnostics, and modular expansion (e.g., in STEM (Sadhukhan et al., 15 Jan 2026)).
- Model Design: Understanding and mitigating concentration informs choices in embedding dimension, training protocols, up-/down-projection structures, and memory slot allocation.
Open questions remain concerning the optimal trade-offs between spread and redundancy, the role of concentration in emergent reasoning and abstraction, and the further development of architectures or optimization routines that fully harness the theoretical capacity of embedding spaces (Kuratov et al., 18 Feb 2025, Sadhukhan et al., 15 Jan 2026).
Key references:
- Kuratov et al., "Cramming 1568 Tokens into a Single Vector and Back Again" (Kuratov et al., 18 Feb 2025)
- "Token embeddings violate the manifold hypothesis" (Robinson et al., 1 Apr 2025)
- "Measuring Intrinsic Dimension of Token Embeddings" (Kataiwa et al., 4 Mar 2025)
- Cho et al., "Understanding Token Probability Encoding in Output Embeddings" (Cho et al., 2024)
- "STEM: Scaling Transformers with Embedding Modules" (Sadhukhan et al., 15 Jan 2026)
- "Attention with Trained Embeddings Provably Selects Important Tokens" (Wu et al., 22 May 2025)