BERT-JEPA: Joint Embedding for Cross-Lingual Transfer

Updated 8 January 2026

The paper introduces BEPA, which integrates a JEPA objective with masked language modeling to transform collapsed [CLS] embeddings into a semantically meaningful thought space.
The methodology employs bilingual paired inputs and an InfoNCE contrastive loss to align translation pairs and improve cross-lingual transfer without degrading English performance.
The results show measurable gains on zero-shot tasks such as XNLI and MLQA, confirming BEPA’s effectiveness in producing language-invariant representations.

BERT-JEPA (BEPA) is a training paradigm that augments BERT-style transformer models with a Joint Embedding Predictive Architecture (JEPA) objective to reorganize the [CLS] embedding space. By superimposing a contrastive JEPA objective onto the canonical masked language modeling (MLM) loss, BEPA converts collapsed [CLS] embeddings into a semantically meaningful, language-invariant representation space—often referred to as a “thought space.” This yields consistent improvements in multilingual transfer tasks without degrading performance on English benchmarks (Gillin et al., 1 Jan 2026).

1. Architectural Modifications to Standard BERT

BEPA introduces architectural changes that preserve the underlying transformer parameters while modifying the input packaging strategy and output layer configuration. The base model is typically an off-the-shelf BERT-derived architecture such as xlm-roberta-base, retaining its original weights and embedding/projection layers. BEPA utilizes dual–[CLS] packaging by feeding paired sentences—often from different languages—into the encoder in the format:

1	[CLS₁] tokens_of_sentence₁ [SEP] [CLS₂] tokens_of_sentence₂ [SEP] [PAD...]

Segment embeddings (A for sentence₁, B for sentence₂) mark token assignments. On top of the final [CLS] embedding, BEPA attaches a lightweight predictor network $P(\cdot)$ ; in the referenced implementation, $P$ is the identity function (no new parameters), but the codebase supports swapping in an MLP or other architectures.

2. Joint Embedding Predictive Architecture (JEPA) Objective

The BEPA training objective combines the standard MLM objective with a contrastive-style JEPA alignment loss. For a translation pair $(x_1, x_2)$ (same semantic content, possibly different languages), the procedure is:

Pass A: mask all of $x_2$ , keep $x_1$ unmasked; extract $h_1 = \text{CLS\_embedding}(x_1 \oplus x_2^\text{masked})$
Pass B: mask all of $x_1$ , keep $x_2$ unmasked; extract $h_2 = \text{CLS\_embedding}(x_1^\text{masked} \oplus x_2)$
Compute $p_1 = P(h_1)$ ; in practice, $p_1 = h_1$
The InfoNCE alignment loss is:

$L_{\text{align}} = -\sum_{i=1}^N \log \frac{\exp(\mathrm{sim}(p_1^i, h_2^i)/\tau)}{\sum_{j=1}^N \exp(\mathrm{sim}(p_1^i, h_2^j)/\tau)}$

where $\mathrm{sim}(u,v) = u \cdot v / (\|u\| \|v\|)$ and $\tau$ is the temperature.

The total BEPA loss is a linear combination:

$L_{\text{BEPA}} = L_{\text{MLM}} + \lambda L_{\text{align}}$

where $L_{\text{MLM}}$ is the canonical cross-entropy loss over 15% randomly masked tokens, and $\lambda=1$ .

3. Collapsed [CLS] Embeddings and Emergence of Language-Invariant Structure

Standard BERT architectures exhibit a collapsed [CLS] pathology: their embedding space is dominated by a single tight cluster (cosine similarity $\sim$ 0.99) regardless of semantic similarity, reducing utility for sentence-level semantics. BEPA's JEPA alignment explicitly pulls translation pairs (positive pairs) closer (cosine $\sim$ 0.97) and repels unrelated pairs (in-batch negatives, cosine $\sim$ 0.45–0.60), yielding a semantically meaningful manifold.

Diagnostic analyses reveal:

Principal Component Analysis: base XLM-RoBERTa concentrates 78% variance in the first PC; BEPA spreads variance across 10+ PCs (first PC $\sim$ 34%).
t-SNE: baseline produces tight clusters per language; BEPA generates interleaved, cross-lingual clusters.

This restructuring reorganizes the [CLS] “thought space,” supporting cross-lingual analogues.

4. Training Protocol, Data, and Hyperparameters

BEPA training employs bilingual parallel corpora: 10k pairs from OPUS-100 plus an English–Swahili bootstrapped dataset. Ablations use Flores-101 (2k pairs) and a small OPUS subset. Benchmarks include XNLI, MLQA (SQuAD-v1.1 zero-shot), and GLUE.

Preprocessing utilizes XLM-RoBERTa tokenization, input length capped at 256, with appropriate [CLS], [SEP], [PAD] tokens. Training hyperparameters are:

AdamW optimizer, $lr=2\times10^{-5}$ , weight decay 0.01, 500-step warmup
Batch size 16, 10 epochs on finetuning corpus
Mask probability: 15%, JEPA weight $\lambda=1$
Single NVIDIA RTX A5500 GPU

5. Evaluation: Benchmarks and Experimental Results

The BEPA framework is evaluated across several tasks:

GLUE (English only):

XLM-RoBERTa baseline: average 88.07%
BEPA-Mono: 88.12%
BEPA-Bi: 88.64% No penalty; small increases.

XNLI (zero-shot cross-lingual transfer, 14 languages):

XLM-RoBERTa baseline: avg accuracy 0.728
BEPA-Mono: 0.732 (+0.4)
BEPA-Bi: 0.744 (+1.6) Ablations (2k Flores / small OPUS / 10k OPUS) confirm Bi > Mono > Base ordering.

MLQA (zero-shot extractive QA):

XLM-RoBERTa baseline cross-lingual F1: typically <30; monolingual $\sim$ 80 F1
BEPA-Bi: monolingual $\sim$ 81 F1; cross-lingual off-diagonals increase by $\sim$ 10–15 points

Ablation highlights:

Bilingual packaging yields superior results versus monolingual.
InfoNCE alignment loss outperforms MSE or cosine alternatives.
$\lambda=1$ optimal; further sweeps or SigREG suggested for future studies.

Benchmark	XLM-RoBERTa Base	BEPA-Mono	BEPA-Bi
GLUE (avg, %)	88.07	88.12	88.64
XNLI (acc)	0.728	0.732	0.744
MLQA F1 (mono)	~80	—	~81
MLQA F1 (cross)	<30	—	+10–15

6. Mechanistic Insights and Significance

BEPA finetuning directly reorganizes sentence-level [CLS] embeddings into a high-rank, semantically structured, language-invariant “thought space.” Semantic analogues from disparate languages occupy proximate regions in this manifold. Downstream impacts include systematic gains in zero-shot cross-lingual tasks (XNLI, MLQA) and retention of English benchmark performance (GLUE, SQuAD v1.1).

The mechanism aligns latent representations to semantic rather than language-specific features, facilitating improved linguistic generalization. This suggests that BEPA can be readily adopted on any BERT-style architecture with minimal parameter changes, and may become a standard protocol for cross-lingual adaptation and transfer learning tasks.

7. Limitations and Future Directions

Ablations indicate that bilingual input packaging and InfoNCE are critical for optimal alignment. The current approach uses a fixed $\lambda=1$ , but further sweeps and alternative regularizers (e.g., SigREG) are proposed. A plausible implication is that extending JEPA objectives beyond sentence-level pairs or incorporating richer predictors could further enhance manifold structuring and downstream performance (Gillin et al., 1 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

BERT-JEPA: Reorganizing CLS Embeddings for Language-Invariant Semantics (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BERT-JEPA (BEPA).