BERT-JEPA: Joint Embedding for Cross-Lingual Transfer
- The paper introduces BEPA, which integrates a JEPA objective with masked language modeling to transform collapsed [CLS] embeddings into a semantically meaningful thought space.
- The methodology employs bilingual paired inputs and an InfoNCE contrastive loss to align translation pairs and improve cross-lingual transfer without degrading English performance.
- The results show measurable gains on zero-shot tasks such as XNLI and MLQA, confirming BEPA’s effectiveness in producing language-invariant representations.
BERT-JEPA (BEPA) is a training paradigm that augments BERT-style transformer models with a Joint Embedding Predictive Architecture (JEPA) objective to reorganize the [CLS] embedding space. By superimposing a contrastive JEPA objective onto the canonical masked language modeling (MLM) loss, BEPA converts collapsed [CLS] embeddings into a semantically meaningful, language-invariant representation space—often referred to as a “thought space.” This yields consistent improvements in multilingual transfer tasks without degrading performance on English benchmarks (Gillin et al., 1 Jan 2026).
1. Architectural Modifications to Standard BERT
BEPA introduces architectural changes that preserve the underlying transformer parameters while modifying the input packaging strategy and output layer configuration. The base model is typically an off-the-shelf BERT-derived architecture such as xlm-roberta-base, retaining its original weights and embedding/projection layers. BEPA utilizes dual–[CLS] packaging by feeding paired sentences—often from different languages—into the encoder in the format:
1 |
[CLS₁] tokens_of_sentence₁ [SEP] [CLS₂] tokens_of_sentence₂ [SEP] [PAD...] |
Segment embeddings (A for sentence₁, B for sentence₂) mark token assignments. On top of the final [CLS] embedding, BEPA attaches a lightweight predictor network ; in the referenced implementation, is the identity function (no new parameters), but the codebase supports swapping in an MLP or other architectures.
2. Joint Embedding Predictive Architecture (JEPA) Objective
The BEPA training objective combines the standard MLM objective with a contrastive-style JEPA alignment loss. For a translation pair (same semantic content, possibly different languages), the procedure is:
- Pass A: mask all of , keep unmasked; extract
- Pass B: mask all of , keep unmasked; extract
- Compute ; in practice,
- The InfoNCE alignment loss is:
where and is the temperature.
The total BEPA loss is a linear combination:
where is the canonical cross-entropy loss over 15% randomly masked tokens, and .
3. Collapsed [CLS] Embeddings and Emergence of Language-Invariant Structure
Standard BERT architectures exhibit a collapsed [CLS] pathology: their embedding space is dominated by a single tight cluster (cosine similarity 0.99) regardless of semantic similarity, reducing utility for sentence-level semantics. BEPA's JEPA alignment explicitly pulls translation pairs (positive pairs) closer (cosine 0.97) and repels unrelated pairs (in-batch negatives, cosine 0.45–0.60), yielding a semantically meaningful manifold.
Diagnostic analyses reveal:
- Principal Component Analysis: base XLM-RoBERTa concentrates 78% variance in the first PC; BEPA spreads variance across 10+ PCs (first PC 34%).
- t-SNE: baseline produces tight clusters per language; BEPA generates interleaved, cross-lingual clusters.
This restructuring reorganizes the [CLS] “thought space,” supporting cross-lingual analogues.
4. Training Protocol, Data, and Hyperparameters
BEPA training employs bilingual parallel corpora: 10k pairs from OPUS-100 plus an English–Swahili bootstrapped dataset. Ablations use Flores-101 (2k pairs) and a small OPUS subset. Benchmarks include XNLI, MLQA (SQuAD-v1.1 zero-shot), and GLUE.
Preprocessing utilizes XLM-RoBERTa tokenization, input length capped at 256, with appropriate [CLS], [SEP], [PAD] tokens. Training hyperparameters are:
- AdamW optimizer, , weight decay 0.01, 500-step warmup
- Batch size 16, 10 epochs on finetuning corpus
- Mask probability: 15%, JEPA weight
- Single NVIDIA RTX A5500 GPU
5. Evaluation: Benchmarks and Experimental Results
The BEPA framework is evaluated across several tasks:
GLUE (English only):
- XLM-RoBERTa baseline: average 88.07%
- BEPA-Mono: 88.12%
- BEPA-Bi: 88.64% No penalty; small increases.
XNLI (zero-shot cross-lingual transfer, 14 languages):
- XLM-RoBERTa baseline: avg accuracy 0.728
- BEPA-Mono: 0.732 (+0.4)
- BEPA-Bi: 0.744 (+1.6) Ablations (2k Flores / small OPUS / 10k OPUS) confirm Bi > Mono > Base ordering.
MLQA (zero-shot extractive QA):
- XLM-RoBERTa baseline cross-lingual F1: typically <30; monolingual 80 F1
- BEPA-Bi: monolingual 81 F1; cross-lingual off-diagonals increase by 10–15 points
Ablation highlights:
- Bilingual packaging yields superior results versus monolingual.
- InfoNCE alignment loss outperforms MSE or cosine alternatives.
- optimal; further sweeps or SigREG suggested for future studies.
| Benchmark | XLM-RoBERTa Base | BEPA-Mono | BEPA-Bi |
|---|---|---|---|
| GLUE (avg, %) | 88.07 | 88.12 | 88.64 |
| XNLI (acc) | 0.728 | 0.732 | 0.744 |
| MLQA F1 (mono) | ~80 | — | ~81 |
| MLQA F1 (cross) | <30 | — | +10–15 |
6. Mechanistic Insights and Significance
BEPA finetuning directly reorganizes sentence-level [CLS] embeddings into a high-rank, semantically structured, language-invariant “thought space.” Semantic analogues from disparate languages occupy proximate regions in this manifold. Downstream impacts include systematic gains in zero-shot cross-lingual tasks (XNLI, MLQA) and retention of English benchmark performance (GLUE, SQuAD v1.1).
The mechanism aligns latent representations to semantic rather than language-specific features, facilitating improved linguistic generalization. This suggests that BEPA can be readily adopted on any BERT-style architecture with minimal parameter changes, and may become a standard protocol for cross-lingual adaptation and transfer learning tasks.
7. Limitations and Future Directions
Ablations indicate that bilingual input packaging and InfoNCE are critical for optimal alignment. The current approach uses a fixed , but further sweeps and alternative regularizers (e.g., SigREG) are proposed. A plausible implication is that extending JEPA objectives beyond sentence-level pairs or incorporating richer predictors could further enhance manifold structuring and downstream performance (Gillin et al., 1 Jan 2026).