mSimCSE: Multilingual Sentence Embeddings
- mSimCSE is a multilingual extension of SimCSE that employs contrastive learning on English data to create a shared, language-agnostic embedding space without parallel corpora.
- The method utilizes unsupervised, English NLI, and cross-lingual supervision regimes to fine-tune multilingual Transformers, significantly boosting performance on retrieval and semantic similarity tasks.
- mSimCSE minimizes reliance on massive parallel data by achieving robust low-resource transfer and performance near that of fully supervised models across diverse languages.
mSimCSE is a multilingual extension of SimCSE, a contrastive learning framework for generating universal cross-lingual sentence embeddings. By finetuning pre-trained multilingual Transformers solely on English data, mSimCSE can induce a shared, language-agnostic sentence representation space without requiring parallel corpora, enabling strong cross-lingual transfer for retrieval, semantic textual similarity (STS), and zero-shot classification tasks across high- and low-resource languages (Wang et al., 2022).
1. Motivation and Background
Standard pre-trained multilingual encoders such as mBERT and XLM-RoBERTa (“XLM-R”) yield sentence embeddings that cluster separately for each language, complicating cross-lingual retrieval and semantic similarity tasks. Previous approaches, including LASER and LaBSE, bring sentence representations into alignment by finetuning on massive parallel corpora—upwards of billions of sentence pairs—an approach costly, especially for low-resource languages. SimCSE introduced an effective contrastive learning paradigm for English sentence embeddings by drawing together semantically equivalent sentence representations via dropout or NLI supervision. mSimCSE adapts this method to the multilingual domain, showing that even contrastive finetuning with only English sentences (without parallel or cross-lingual data) suffices to yield well-aligned universal embeddings (Wang et al., 2022).
2. Model Architecture and Training Objectives
mSimCSE uses a pre-trained multilingual Transformer encoder (XLM-RoBERTa-large). Each input sentence is encoded, and the [CLS] token from the final layer (or an analogous pooling) is extracted as its embedding. For a batch of positive pairs , embeddings are denoted , . The training employs a contrastive InfoNCE objective:
where and is a temperature hyperparameter ($0.05$).
Supervised variants augment the denominator with hard negatives (e.g., contradiction pairs from NLI). No additional regularization is imposed beyond standard model dropout.
3. Training Regimes
mSimCSE can be operated in several settings—unsupervised, leveraging only English NLI data, or using cross-lingual supervision:
- Unsupervised: Each positive pair comprises a single English sentence encoded twice with different dropout masks. Neither parallel data nor NLI supervision is required.
- English NLI Supervised: Pairs of entailment sentences from English NLI datasets (MNLI, SNLI) serve as positives; contradictions furnish hard negatives. Training is exclusively on English–English data.
- Cross-lingual NLI Supervised: NLI triplets are sampled from XNLI; for each, the language of the premise and hypothesis is selected at random among the available 15 languages, thereby constructing cross-lingual entailment and contradiction pairs.
- Fully Supervised: English–X parallel pairs from resources such as ParaCrawl (100K pairs per language) fill the positive set, optionally mixed with English NLI examples.
Notably, even models trained with only English sentences (“mSimCSE”), without any cross-lingual data, produce well-aligned multilingual representations.
4. Evaluation Methodologies and Datasets
Performance is assessed across a suite of cross-lingual and multilingual benchmarks:
- Cross-lingual Sentence Retrieval: Tatoeba (1,000 parallel pairs per language) using top-1 accuracy; BUCC for bitext mining in four high-resource language pairs (reported via F1).
- Multilingual Semantic Textual Similarity (STS): SemEval-2017, with both monolingual and cross-lingual pairs, reporting Pearson and Spearman correlations of embedding cosine similarity with human annotations.
- Zero-shot Paraphrase Identification: PAWS-X, evaluating cross-lingual transfer by substituting the classifier with nearest-neighbor cosine search, reporting average accuracy over seven languages.
- Unsupervised Clustering: TNews Chinese dataset, evaluating the cluster purity of -means on sentence embeddings.
5. Quantitative Results
mSimCSE demonstrates significant gains over baselines and parity with much more data-intensive approaches, particularly in unsupervised and weakly supervised regimes. Key findings are summarized below.
Table 1. Cross-lingual Sentence Retrieval (BUCC F1 / Tatoeba-14 / Tatoeba-36):
| Setting | Model | BUCC | Tat-14 | Tat-36 |
|---|---|---|---|---|
| Unsupervised | XLM-R (baseline) | 66.0 | 57.6 | 53.4 |
| mSimCSE | 87.5 | 82.0 | 78.0 | |
| English NLI supervised | mSimCSE | 93.6 | 89.9 | 87.7 |
| Cross-lingual NLI supervised | mSimCSE | 94.2 | 90.8 | 88.8 |
| mSimCSE | 95.2 | 93.2 | 91.4 | |
| Fully supervised | LASER | 92.9 | 95.3 | 84.4 |
| LaBSE | 93.5 | 95.3 | 95.0 |
Unsupervised mSimCSE outperforms prior multilingual embedding baselines (INFOXLM, DuEAM, HiCTL) by substantial margins. English NLI supervision closes most of the gap to fully supervised models such as LASER and LaBSE.
Table 2. Tatoeba (Selected High- and Low-Resource Languages, Accuracy):
| Model / Setting | hi | fr | de | ga | am | sw |
|---|---|---|---|---|---|---|
| mSimCSE (unsup.) | 86.9 | 87.2 | 94.1 | 39.2 | 48.8 | 29.4 |
| mSimCSE (Eng NLI) | 94.4 | 93.9 | 98.6 | 54.8 | 79.5 | 42.1 |
| mSimCSE (XNLI sup.) | 96.2 | 94.8 | 98.8 | 65.1 | 82.4 | 67.8 |
In low-resource languages, unsupervised or English-supervised mSimCSE maintains 35–50% accuracy, outperforming fully supervised approaches trained on parallel data.
Table 3. Multilingual STS (Spearman's , SemEval-2017):
| Model / Setting | ar–ar | ar–en | es–es | es–en | tr–en |
|---|---|---|---|---|---|
| mSimCSE (unsup.) | 72.3 | 48.4 | 83.7 | 57.6 | 53.4 |
| mSimCSE (Eng NLI) | 81.6 | 71.5 | 87.5 | 79.6 | 71.1 |
| mSimCSE | 79.4 | 72.1 | 85.3 | 77.8 | 74.2 |
Table 4. Zero-shot PAWS-X (Average Accuracy):
| Model / Setting | Avg. Accuracy |
|---|---|
| mSimCSE (unsup.) | 88.1 |
| mSimCSE (Eng NLI) | 88.2 |
Table 5. Unsupervised Clustering on TNews (Purity):
| Model / Setting | Purity (%) |
|---|---|
| mSimCSE (unsup.) | 30.3 |
| mSimCSE (Eng NLI) | 40.3 |
| mSimCSE | 41.6 |
6. Analysis: Low-Resource Transfer and Cross-Lingual Supervision
In low-resource languages—such as Irish, Amharic, and Georgian—mSimCSE, even when trained with only English data, achieves non-trivial retrieval accuracies (35–50%), often outperforming fully supervised English–X models trained solely on parallel data. Supplementing English NLI supervision with translated NLI for a specific low-resource target (e.g., Swahili) can yield dramatic accuracy improvements (e.g., from 42% to 75% Tatoeba accuracy with 100K additional sentence pairs).
Adding large quantities of English–French parallel pairs provides only modest additional benefit for BUCC (e.g., +4 points with 5M pairs) and no observable generalization gain to other languages on Tatoeba. This suggests diminishing returns in relying purely on parallel bitext for universal alignment, particularly for unseen or low-resource languages, whereas NLI supervision (especially in the target language) is much more effective for broad, language-agnostic alignment (Wang et al., 2022).
7. Implementation Details and Practical Recommendations
mSimCSE finetuning is conducted on XLM-RoBERTa-large with batch size 128, learning rate (AdamW), 0.1 dropout, and a of 0.05. Training requires minimal epochs (0.5–2 are stable). Data includes 1M English Wikipedia sentences (unsupervised), 500K English NLI tuples (Eng NLI), ~249K cross-lingual XNLI samples, and optionally 100K parallel pairs per language from ParaCrawl.
For users lacking any parallel data, running unsupervised mSimCSE with English Wikipedia results in effective cross-lingual embeddings. Incorporating English NLI (a few hundred thousand pairs) bridges most of the performance gap to fully supervised models. For improved target-language alignment, translated NLI (or modest monolingual entailment data) for the target can be mixed in. mSimCSE’s training has robust hyperparameter tolerances: batch size (128–256), learning rate ($1$–), epochs (1–3). The model can serve as an initializer for further bilingual mining or be directly deployed for cosine retrieval and related downstream tasks (Wang et al., 2022).
8. Significance and Research Implications
mSimCSE demonstrates that contrastive learning on English data alone suffices to produce high-quality universal cross-lingual sentence embeddings, obviating the need for large-scale parallel data and unlocking scalable, low-cost multilingual representation learning. The approach is effective across a wide range of languages and resource levels, is simple to deploy, and is insensitive to many hyperparameter variations. A plausible implication is that cross-lingual transfer is substantially enabled by the inherent language-agnostic capacity of multilingual Transformers, with contrastive objectives serving to unlock shared representations even in the absence of explicit cross-lingual supervision. The codebase is available at https://github.com/yaushian/mSimCSE (Wang et al., 2022).