Collapsed [CLS] Embeddings
- Collapsed [CLS] embeddings are defined as the convergence of [CLS] token outputs into a narrow, near-identical subspace, limiting semantic discrimination across diverse inputs.
- Empirical diagnostics using pairwise cosine similarity, PCA, and effective rank reveal that models like BERT, RoBERTa, and XLM-RoBERTa exhibit near-total variance concentration and redundancy in [CLS] representations.
- Mitigation strategies such as JEPA alignment, LMK pooling, and dimensionality reduction aim to restore representational diversity and enhance downstream task performance.
Collapsed [CLS] embeddings refer to a well-documented phenomenon in Transformer-based LLMs whereby the special [CLS] token’s output vector, after fine-tuning or in standard BERT-like architectures, occupies a narrow, highly redundant subspace—sometimes essentially a single principal direction or very few dimensions—regardless of the semantic content of the underlying input. This collapse undermines both the expressiveness and the downstream transferability of the [CLS] representation, particularly across languages or in long-context settings, and is consistently observed in empirical studies, theoretical work, and practical deployment analyses (Gillin et al., 1 Jan 2026, Matton et al., 2019, Doshi et al., 29 Jan 2026, Wu et al., 22 May 2025).
1. Characterization and Diagnostics of Collapsed [CLS] Embeddings
Collapse is rigorously defined as the situation in which the [CLS] vectors for diverse input sequences—semantically unrelated or across disparate languages—exhibit exceedingly high mutual similarity and reside within a low-dimensional manifold. This is quantifiable using several diagnostic metrics:
- Pairwise Cosine Similarity: In collapsed spaces, even for completely unrelated sentence pairs (Gillin et al., 1 Jan 2026).
- Principal Component Analysis (PCA): Collapsed embeddings exhibit a spectrum dominated by a single principal component, which explains 50–80% of the total variance in models like RoBERTa-Base (47%) or XLM-RoBERTa-Base (78%) (Gillin et al., 1 Jan 2026). Finetuned BERT [CLS] vectors exhibit a similar concentration of variance (explained in a few top components) (Matton et al., 2019).
- Effective Rank (e.g., RankMe): Spaces with collapse have effective rank ≈1–2, indicating only 1–2 directions carry nearly all informational content (Gillin et al., 1 Jan 2026).
- t-SNE and Isotropy Diagnostics: Embedding visualizations cluster tightly according to language, with little or no overlap between translation equivalents (Gillin et al., 1 Jan 2026).
2. Theoretical Explanations for [CLS] Collapse
The root mechanisms driving collapse differ depending on architecture and training strategy:
- Softmax-Attention Models: In one-layer softmax-attention with a trainable [CLS] vector and word embedding table , gradient descent with logistic loss rapidly aligns with the output head vector in proportion to token-class correlations, and gradient flow on maximizes the selection margin for important features. This converges to a “winner-takes-all” [CLS], where attends almost exclusively to highly predictive tokens, mathematically saturating the representation and forcing collapse (Wu et al., 22 May 2025).
- Transformer Pooling and RoPE: In BERT-like architectures using rotary positional embeddings (RoPE), the positional bias strongly attenuates attention weights as distance from [CLS] increases, restricting information aggregation to initial tokens. As a result, [CLS] output concentrates on the initial document span, ignoring distributed evidence and collapsing the representation for long or information-dispersed sequences (Doshi et al., 29 Jan 2026).
- Finetuning for Classification: Fine-tuned models often segregate class decision information into a very low-dimensional subspace. PCA and single-dimension “salient neuron” analyses show that after finetuning, [CLS] output nearly always shrinks onto a subspace with dimensionality similar to the number of classes, amplifying collapse (Matton et al., 2019).
3. Empirical Manifestations and Impact
Empirical studies demonstrate pervasive [CLS] collapse across standard and multilingual BERT-style models:
- Quantitative Benchmarks:
- RoBERTa/XLM-RoBERTa: Cosine similarity $0.991$–$0.998$ between [CLS] vectors of unrelated and translated sentence pairs; first PCA component explains 47–78% of variance (Gillin et al., 1 Jan 2026).
- Pairwise cosine trends remain high for positive pairs and only decline for negatives after applying corrective objectives (Gillin et al., 1 Jan 2026).
- Long-Context and Multilingual Performance:
- For long documents, [CLS] attention weights are heavily local, causing under-representation of distributed evidence; retrieval and classification metrics degrade as input length increases (Doshi et al., 29 Jan 2026).
- In multilingual settings, collapsed [CLS] embeddings result in language-specific clusters with poor cross-lingual alignment, impeding zero-shot transfer (Gillin et al., 1 Jan 2026).
| Diagnostic | Typical Value (Collapsed) | Source |
|---|---|---|
| Cosine (unrelated) | 0.995–0.998 | (Gillin et al., 1 Jan 2026) |
| PCA PC1 var. | 47% (RoBERTa), 78% (XLM-RoBERTa) | (Gillin et al., 1 Jan 2026) |
| Effective Rank | 1–2 | (Gillin et al., 1 Jan 2026) |
4. Mitigation Strategies and Remedies
Several strategies have been proposed to counteract [CLS] collapse, each targeting different underlying mechanisms:
- Joint Embedding Predictive Architectures (JEPA): In BERT-JEPA (BEPA), a JEPA-style InfoNCE alignment loss is added during fine-tuning. This decorrelates [CLS] vectors for unrelated pairs and incentivizes language-invariant, semantically meaningful structure. Empirically, this reduces same/different-language unrelated cosine similarities from ≈0.99 to ≈0.50–0.60 for negatives, lowers the variance explained by PC1 from up to 78% to ≈34%, and raises embedding effective rank from ≈1–2 to ≈5–10 (Gillin et al., 1 Jan 2026).
- Landmark (LMK) Pooling: By partitioning the input sequence into chunks and pooling over multiple landmark ([SEP]) tokens, LMK sidesteps the positional bias inherent in RoPE and distributes representation capacity across the document. This method yields improved retrieval performance, especially for long input sequences, and recovers global coverage without sacrificing local salient evidence (Doshi et al., 29 Jan 2026).
- Compression via PCA and Salient Neuron Pruning: For models where class information is collapsed into a low-dimensional manifold, projecting [CLS] vectors into the top principal components ( number of classes), or even selecting the top most salient coordinates, retains near-optimal downstream accuracy while removing vast redundancy ( 0.5 pp loss for –$25$) (Matton et al., 2019).
5. Quantitative Results and Comparative Analyses
Results quantifying the efficacy of different strategies and the severity of collapse include:
- JEPA/BERT-JEPA:
- XNLI zero-shot transfer: Baseline XLM-RoBERTa 72.8% BEPA-Bilingual 74.4% accuracy (+1.6%).
- MLQA F1: Baseline 62.1, BEPA-Bilingual 63.2.
- Effective Rank (RankMe): Baseline ≈1–2 BEPA ≈5–10 (Gillin et al., 1 Jan 2026).
- LMK Pooling:
- MLDR NDCG@10: CLS 24.9, LMK 35.0 (Δ+10.1).
- Robust improvement on out-of-domain multilingual retrieval (MLDR: CLS 24.9, LMK 37.2) (Doshi et al., 29 Jan 2026).
- Dimensionality Reduction:
- For IMDB sentiment analysis, projecting to components: full accuracy (93.7%), : 93.7%, full (768-d): 93.7%.
- Top-1 “salient neuron” recapitulates nearly all accuracy on 2-class datasets; top neurons recover all accuracy for -class tasks (Matton et al., 2019).
6. Broader Implications and Recommendations
The collapse of [CLS] embeddings poses challenges for semantic discrimination, cross-lingual generalization, and scalability to long contexts or multi-task settings. The following recommendations follow from the empirical and theoretical literature:
- For Multilingual and Zero-Shot Tasks: Use alignment objectives (e.g., InfoNCE) to explicitly decorrelate [CLS] representations across unrelated samples; train with both monolingual and bilingual packaging for best cross-lingual generalization (Gillin et al., 1 Jan 2026).
- For Long Sequence Embeddings: Prefer distributed pooling schemes (e.g., LMK pooling) to avoid biased aggregation of information at the document head (Doshi et al., 29 Jan 2026).
- For Storage and Compute Efficiency: Compress output representations via PCA/SVD or single-dimension pruning, reducing memory and compute with negligible loss in classification accuracy (Matton et al., 2019).
- Model Design: Question and adapt the exclusive reliance on a single [CLS] vector, especially for tasks requiring wide semantic coverage or robust retrieval beyond short contexts (Doshi et al., 29 Jan 2026).
7. Open Questions and Future Directions
Despite advances in remedies, several open problems remain:
- Theoretical analysis of collapse in deep versus shallow attention models beyond single- or two-layer softmax attention (Wu et al., 22 May 2025).
- Generalization of landmark pooling to multi-modal or multi-vector architectures (Doshi et al., 29 Jan 2026).
- Optimal balancing between information bottlenecking (to facilitate compression) and retention of discriminative capacity for highly semantic or reasoning-intensive tasks (Matton et al., 2019).
- Systematic study of JEPA-style alignment objectives with flexible predictors for various downstream generalization and transfer scenarios (Gillin et al., 1 Jan 2026).
A plausible implication is that future embedding architectures may adopt hybrid pooling and alignment strategies to both preserve contextual richness and support efficient, analytically tractable representations for cross-lingual and long-context NLP tasks.