SENSIA: Symmetric Interlingual Alignment
- SENSIA is a multilingual framework that explicitly aligns latent sense representations using symmetric contrastive objectives.
- It enforces local and global semantic isomorphism, yielding state-of-the-art data efficiency with significantly reduced bitext.
- The model employs a three-phase curriculum on a GPT-2 based Backpack architecture to preserve fluency and robust semantic geometry.
SENse-based Symmetric Interlingual Alignment (SENSIA) conceptualizes multilingual language modeling as the explicit adaptation and alignment of latent sense-level representations rather than indirect parameter or embedding sharing. SENSIA enforces both local and global semantic isomorphism across languages by aligning the internal sense mixtures and contextual representations in parallel data through symmetric contrastive objectives, while preserving fluency in the target language via language modeling. This paradigm yields state-of-the-art data efficiency and robust cross-lingual transfer, outperforming traditional multilingual adaptation methods and competing closely with much larger monolingual and multilingual models (Cruz et al., 15 Jan 2026).
1. Motivation and Problem Formalization
Naïve multilingual LLMs often presume that a single shared parameter or embedding space can adequately capture word senses across typologically diverse languages. This unit-of-meaning mismatch frequently results in suboptimal transfer—word senses may cluster according to orthography or frequency rather than meaning, undermining cross-lingual isomorphism. SENSIA addresses this by explicitly coordinating latent sense mixtures and contextual vectors between paired sentences in parallel corpora. This is achieved through symmetric objectives that encourage linguistically motivated sense distinctions, data-efficient mapping (requiring 2–4× less bitext than from-scratch baselines), and stable alignment for downstream generalization (Cruz et al., 15 Jan 2026).
2. Model Architecture and Adaptation Mechanism
SENSIA builds on the Backpack LLM, itself a GPT-2 variant augmented with per-token sense modules:
- Backpack Base:
- Uses a shared GPT-2 BPE vocabulary ().
- Transformer backbone: GPT-2 small (124M), medium (345M), or large (762M) parameters.
- Sense module for each token : maintains sense bases .
- Given token embedding , compute latent senses .
- Mixture weights: (controlled by pool-temperature ).
- Final representation: , replacing the transformer input at .
- A causal LM head computes .
- Cross-lingual Adaptation:
- Alignment phase: sense/context alignment with high contrastive weights, no LM loss.
- Joint phase: mix of contrastive (sense/context) and LM objectives.
- Polish phase: prioritize target-language LM, freeze sense bases and weighting to prevent drift from English geometry.
- 3. All curriculum weights and temperatures follow a smooth cosine schedule.
This architecture emphasizes explicit modeling and mixing of multiple latent senses and aligns these sense mixtures between aligned sentence pairs, in contrast to previous sense-agnostic or single-vector approaches (Liu et al., 2021).
3. Symmetric Contrastive Objectives and Loss Structure
SENSIA's training objective systematically aligns interlingual sense and context representations while ensuring target-language linguistic adequacy:
- Sense-Mixture Alignment: For each parallel sentence pair, compute L2-normalized mean pooled sense-mixed vectors (, ).
- Context Alignment: Use L2-normalized transformer hidden states at the last non-pad token (, ).
Alignment is enforced by symmetric InfoNCE loss: where are appropriately normalized cross-batch similarities. The total loss: Combining sense alignment, context alignment, and label-smoothed language modeling losses with phase-tuned weights.
Training Algorithm:
A curriculum alternates alignment, joint, and polish phases, adaptively annotating the importance of losses via cosine annealing. During the polish phase, sense parameters are frozen to preserve the established interlingual geometry (Cruz et al., 15 Jan 2026).
4. Empirical Evaluation and Quantitative Results
SENSIA is evaluated across four typologically diverse languages (Estonian, Indonesian, Swahili, Turkish) and standardized English-to-X transfer tasks:
- Intrinsic Evaluation: FLORES-200 recall@1, target-side sense entropy, dev/test perplexity.
- Downstream (XCOPA, XStoryCloze, Belebele): Conditional likelihood and optional PMI correction plus length penalty.
Benchmarks:
| Model | Downstream Avg (%) | Tokens (M) | Data vs. Goldfish |
|---|---|---|---|
| GPT-2+FT | ~49.8 | varies | - |
| MCL | ~51.0 | varies | - |
| SENSIA | ~53.3 | 83–149 | 2–4× less |
| Goldfish | ~52.7 | ~600 | baseline (1×) |
| XGLM/BLOOMZ | 55–67 | >1B | - |
SENSIA matches or surpasses strong monolingual baselines (Goldfish, 1GB data) while using only 246–509MB of parallel bitext (83–149M tokens) per language. In seven out of ten configurations, SENSIA is within ±1pp of Goldfish, outperforming in four. SENSIA narrows the accuracy gap with 7B+ parameter models (e.g., XGLM-7B5, BLOOMZ-7B1), e.g., on XCOPA-Turkish within 0.6pp of XGLM. (Cruz et al., 15 Jan 2026)
5. Analysis of Learned Sense Geometry
SENSIA’s sense alignment preserves both local and global semantic structure across languages:
- Local Geometry (Sense Topology): Gram matrix correlation of sense vectors shows improved transfer:
| Language | SENSIA | Control | | |------------|---------------------|----------------------|--------------------| | Estonian | 0.27 | 0.17 | +0.10 | | Indonesian | 0.30 | 0.21 | +0.09 | | Swahili | 0.26 | 0.17 | +0.09 | | Turkish | 0.25 | 0.16 | +0.09 |
- Global Geometry: Procrustes analysis on 10k sense mixtures per word yields cosine similarities:
| Language | SENSIA | Control | cos | |------------|---------|---------|-------------| | Estonian | 0.36 | 0.29 | +0.08 | | Indonesian | 0.44 | 0.33 | +0.11 | | Swahili | 0.42 | 0.31 | +0.11 | | Turkish | 0.35 | 0.26 | +0.09 |
Interpretation: Up to a near-orthogonal rotation, the sense manifold geometry of English is preserved in the target language (Cruz et al., 15 Jan 2026).
This suggests that SENSIA achieves robust isomorphism at both the micro (sense topology) and macro (sense-manifold geometry) levels, confirming the effectiveness of explicit sense-alignment.
6. Ablation Studies and Scalability
Extensive ablation validates design choices:
- Sense-mixture modeling: Inference ablations (full mixture vs. Top-1/Uniform) show that fluent prediction demands soft mixture; e.g., for Swahili: Full (2.98 CE) ≪ Top-1 (8.77) ≪ Uniform (10.10).
- Loss Component/Phase Removal: LM loss is indispensable (otherwise degenerate perplexity). All curriculum phases contribute: joint phase (contrastive+LM) is most critical, sense loss anchors semantics, context loss sharpens decisions, polish phase refines fluency.
- Scaling: Larger Backpack models monotonically improve accuracy (+2.2pp avg) and reduce perplexity.
- Low-resource adaptation: Using only 100k pairs (versus 2M), accuracy degrades gracefully by ∼2.5–3.3pp.
- Script Sensitivity: English BPE tokenization on Chinese results in a 3–6pp deficit to monolingual baselines, indicating BPE’s Latin-script bias and the need for script-compatible tokenization (Cruz et al., 15 Jan 2026).
7. Design Recommendations and Comparison to Related Work
Design insights include:
- Cosine-annealed curriculum with a three-phase schedule yields stable and effective alignment.
- Freezing sense bases and weights during the polish phase is key to preserving semantic geometry.
- Shared BPE suffices for Latin scripts; non-Latin scripts require alternate tokenization.
- Approximately 2 million high-quality parallel pairs are sufficient for strong cross-lingual alignment.
SENSIA is directly preceded by sense-aware and symmetric cross-lingual objectives, such as Bi-SaELMo and SaBERT, which use sense-specific softmax heads and dictionary-augmented sense selection for explicit alignment. Both frameworks report up to ~2% absolute improvement over strong baselines in word sense disambiguation and zero-shot transfer tasks, demonstrating the benefit of sense-level modeling for cross-lingual alignment (Liu et al., 2021).
A plausible implication is that future large-scale multilingual models may further benefit from combining explicit sense-mixing modules, symmetric loss structures, and curriculum-driven alignment, particularly when resource constraints limit parallel data availability or coverage. The extension of such models to 7B+ parameter backbones and instruction-tuned, document-level supervision remains an open direction for further research (Cruz et al., 15 Jan 2026).
Primary references:
- (Cruz et al., 15 Jan 2026) "Multilinguality as Sense Adaptation"
- (Liu et al., 2021) "Towards Multi-Sense Cross-Lingual Alignment of Contextual Embeddings"