Anisotropy in Embedding Representations
- Anisotropy in embedding representations is the phenomenon where high-dimensional vectors collapse into a narrow cone, evident from high cosine similarities and skewed eigenvalue distributions.
- Empirical studies across NLP, speech, vision, and protein models demonstrate that anisotropy affects clustering, retrieval, and classification, with key metrics such as Avg Cosine, IsoScore, and SVD ratios quantifying its extent.
- Mitigation strategies like PCA postprocessing, whitening transforms, and contrastive regularization help adjust the embedding space to balance task-specific structure with improved isotropy.
Anisotropy in embedding representations denotes the empirical phenomenon in which high-dimensional vectors generated by neural models—particularly Transformers—fail to occupy all directions of their ambient space uniformly, instead collapsing into a narrow, low-dimensional cone. This concentrated geometry results in high average pairwise cosine similarity, even for unrelated items, and influences multiple aspects of learning, transfer, and downstream task performance in natural language processing, speech, vision, protein, and code domains.
1. Formal Characterization and Quantification
Anisotropy is operationally defined as a deviation from isotropy, where the ideal is a uniform spread of embeddings across all directions in the vector space. The main diagnostic tools and metrics include:
- Average pairwise cosine similarity: For a set of vectors,
An isotropic space yields ; anisotropic spaces show .
- Eigenvalue or singular value ratio: For an embedding matrix , SVD or PCA yields singular or eigenvalues . The anisotropy can be quantified as (or equivalently, for the covariance matrix), with higher ratios indicating stronger anisotropy.
- IsoScore: A normalized, covariance-based metric,
proposed to accurately capture the number of utilized dimensions, with IsoScore (perfect isotropy) and 0 (complete anisotropy) (Rudman et al., 2021).
Anisotropy is also detected using principal component analysis (proportion of variance explained by top PCs), partition-function scores, and persistent entropy computed from Vietoris–Rips filtrations (Kudriashov et al., 9 Jan 2025).
2. Empirical Findings Across Architectures and Modalities
Anisotropy is exhibited pervasively across pre-trained transformer models, regardless of modality:
- NLP (token/subword/byte models): BERT, RoBERTa, GPT-2, and T5 show 1 values rising from 20.1 in early layers to 30.7 in deeper layers (Godey et al., 2023, Godey et al., 2024). Character-level (CANINE-c, MANTa, ByT5) and byte-level models also display high anisotropy, discrediting the hypothesis that it arises solely from rare-token drift due to cross-entropy on long-tailed distributions.
- Speech: Pretrained models such as wav2vec2 and HuBERT present average layerwise cosine similarities up to 0.9–1.0 in top layers, with global anisotropy measures 4 (i.e., average cosine 50.54). Nevertheless, high anisotropy does not preclude fine phonetic discrimination in tasks such as keyword spotting, where Dynamic Time Warping (DTW) methods exploit subtle variations in similarity for robust query retrieval (Wisniewski et al., 6 Jun 2025).
- Vision and Proteins: Vision Transformers (ViT, BEiT) exhibit mid-layer anisotropy (6); convolutional nets remain mostly isotropic except for early ResNet blocks. In protein LMs, ProtBERT and ProtXLNet use only 2–3 effective global dimensions out of 1024, confirmed by IsoScore (70.0015). Multi-modal protein LMs (ProteinBERT), incorporating sequence and ontology, use 8120 effective dimensions (IsoScore 90.23) (Hakim et al., 12 Oct 2025).
- Transformers vs. CNNs: Anisotropy is far more pronounced in transformer-based models than in convolutional architectures, which generally remain isotropic, except in rare cases and early blocks (Godey et al., 2023).
3. Underlying Causes: Self-Attention and Training Dynamics
Research demonstrates that anisotropy is not merely a side effect of objective function or token distribution but is inherent in the geometry induced by transformer self-attention (Godey et al., 2023, Godey et al., 2024).
- Mechanism: Self-attention layers amplify any shared mean in their queries and keys. If 0 are large, the dot-product 1 is nearly constant, leading the attention softmax to become sharp (low-entropy) and almost identical across tokens—collapsing output embeddings towards a single direction.
- Empirical validation: Adding a fixed bias to an untrained transformer block causes output norms and mean cosines to stabilize at a high value (fixed point), confirming the amplification of bias by random self-attention (Godey et al., 2023). This effect persists across vision and speech transformers, where neither token frequency nor mean drift fully explains the observed anisotropy.
- Training dynamics: Encoder anisotropy remains flat across layers; decoders display a bell-shaped anisotropy curve, peaking in the middle and declining towards the output. During training, embedding intrinsic dimension first expands, then sharply compresses, coinciding with increasing anisotropy (Razzhigaev et al., 2023).
4. Impact on Information Structure, Clustering, and Tasks
Anisotropy interacts fundamentally with clustering, linear classification, and retrieval:
- Clustering vs. isotropy: Theoretical analysis confirms that isotropy (uniform pairwise distances) is at odds with the formation of compact clusters necessary for classification. Silhouette scores (clustering objective) increase as isotropy (measured by IsoScore) decreases, with a strong negative correlation (Spearman 2) across multiple tasks (Mickus et al., 2024). Optimal cluster separation (necessary for sharp decision boundaries) requires breaking isotropy.
- Task-specific effects:
- Similarity and retrieval tasks: Graded similarity judgments and cross-lingual retrieval benefit from isotropy; clusters would distort continuous similarity scales.
- Classification and clustering: Anisotropy is favorable, as clusters and their prototypes require variance concentration in certain directions.
- Sense disambiguation: Highly anisotropic spaces inadequately separate different senses of a given lexeme. Removal of principal components (as in LASeR) and sense retrofitting enhance both isotropy and sense-cluster discrimination (Bihani et al., 2021).
- Fine-tuning effects: Fine-tuning does not necessarily increase isotropy. In fact, the principal, elongated directions of the embedding space absorb critical task knowledge after fine-tuning. Isotropy enhancement (e.g., top-PC removal), beneficial during pre-training, becomes counterproductive, erasing task-specific structure (Rajaee et al., 2021).
- Speech: Even at extreme anisotropy (cosine 3), fine phonetic nuances are encoded in small residual variations, allowing effective zero-shot keyword spotting (Wisniewski et al., 6 Jun 2025).
5. Mitigation Strategies: Postprocessing, Regularization, and Architecture
The mitigation of anisotropy is approached through a variety of postprocessing and regularization schemes:
- Global centering and PCA projection: Subtracting the mean vector and removing leading principal components ("all-but-the-top") achieve near-isotropy in many static and contextual models, improving sentence similarity and retrieval scores (Rajaee et al., 2021, Bihani et al., 2021).
- Cluster-based whitening: K-means clustering followed by local removal of dominant PCs further enhances isotropy, particularly for multilingual and polysemous data, and yields consistent performance improvements on STS and cross-lingual tasks (Rajaee et al., 2021, Rajaee et al., 2021, Hämmerl et al., 2023).
- Whitening transforms: ZCA and Soft-ZCA whitening (with a spectral stabilizer) are computationally efficient, model-agnostic procedures that balance variance preservation and isotropy. Tuning the whitening parameter 4 allows fine control over the degree of isotropy, resulting in strong gains for semantic code search and retrieval (Diera et al., 2024).
- Persistent entropy regularization: Maximizing the persistent entropy of Vietoris–Rips filtrations in mini-batched embeddings flattens the principal component spectrum, increasing isotropy and improving downstream classification accuracy, without increasing inference cost (Kudriashov et al., 9 Jan 2025).
- Iterative normalization: Alternating mean-centering and length-projection (iterative normalization) efficiently enforces isotropy, critical for cross-lingual mapping and dependency parsing by normalizing cone orientations between independent spaces (Xu et al., 2021, Xu et al., 2021).
- Contrastive and spectrum-control losses: In pretraining, contrastive losses and explicit penalization of top eigenvalues (spectrum control) can regularize embedding variance but often degrade downstream task performance if enforced indiscriminately (Ding et al., 2021, Godey et al., 2024).
- Local isotropy focus: There is substantial evidence that transformer embedding spaces are already locally isotropic within clusters (e.g., for the same word type or POS), and that global anisotropy is an artifact of a few dominant directions (Ding et al., 2021). Interventions should preserve or exploit local structure instead of enforcive global uniformity.
6. Interpretive Insights and Practical Guidelines
The literature points to a nuanced relationship between anisotropy, task objectives, and model design:
- Isotropy and utility: No single geometry is optimal across all tasks. For semantic similarity and retrieval, maximizing isotropy neutralizes hubness and enhances score interpretability. For classification and tasks demanding crisp boundaries, fostering useful anisotropy is recommended.
- Role of fine-tuning: During task adaptation, models re-purpose dominant PCs to encode task-relevant information. Isotropy regularization is most effective before this repurposing occurs, and potentially harmful post hoc (Rajaee et al., 2021).
- Managing outlier/rogue dimensions: A small number of outlier axes, with mean activation far from zero, often account for the bulk of anisotropy; their detection and zeroing is a lightweight but partial remedy (Hämmerl et al., 2023, Rajaee et al., 2021).
- Cross-lingual and multi-modal alignment: Anisotropy—especially conical collapse and language-wise "patches" on the sphere—obstructs orthogonal mapping between independently trained embedding spaces. Normalization post-processing (e.g., iterative normalization) restores compatibility and alignment (Xu et al., 2021).
- Model development: Future transformer architectures or objectives may require explicit mechanisms to decouple sharp attention patterns from global geometric collapse—potentially through learned scaling, offsetting, or architectural changes in attention computation (Godey et al., 2023, Godey et al., 2024).
7. Summary Table: Anisotropy Metrics and Empirical Ranges
| Metric | Isotropic Value | Typical Transformer Value | References |
|---|---|---|---|
| Avg Cosine | 5 0 | 0.4–0.8 (mid/deep layer) | (Godey et al., 2023, Godey et al., 2024) |
| IsoScore | 6 1 | 0.001–0.18 | (Rudman et al., 2021, Hakim et al., 12 Oct 2025) |
| SVD Ratio | 7 1 | 10–1000 (mid layers) | (Razzhigaev et al., 2023, Rajaee et al., 2021) |
| Effective Dims | 8 | 2–14 (protLMs); 9 | (Hakim et al., 12 Oct 2025) |
| Bell-shaped Layer Pattern | – | True for decoders, flat for encoders | (Razzhigaev et al., 2023) |
Empirical anisotropy values contextualize the collapse of high-dimensional spaces into narrow, task-driven manifolds, and the spectrum of potential interventions—from global and local whitening to spectral and entropy-based regularization—demands precise, application-aware tuning.
Key references: (Godey et al., 2023, Godey et al., 2024, Rudman et al., 2021, Hakim et al., 12 Oct 2025, Razzhigaev et al., 2023, Rajaee et al., 2021, Rajaee et al., 2021, Xu et al., 2021, Mickus et al., 2024, Kudriashov et al., 9 Jan 2025, Diera et al., 2024, Rajaee et al., 2021, Wisniewski et al., 6 Jun 2025, Hämmerl et al., 2023, Bihani et al., 2021).