Cross-Lingual NLI: Benchmarks & Methods

Updated 22 February 2026

Cross-Lingual NLI is a benchmark for evaluating model inference across diverse languages through parallel datasets and professional translation.
The methodology compares translate-train, translate-test, and zero-shot transfer, leveraging multilingual encoders and advanced pretraining techniques.
Data augmentation, prompt-based learning, and meta-learning strategies address translation artifacts and improve performance on low-resource languages.

Cross-lingual Natural Language Inference (XNLI) is a central benchmark and research challenge for evaluating multilingual language understanding—specifically, a model’s ability to generalize the semantics of natural language inference across typologically diverse languages, including low-resource scenarios. XNLI combines large-scale parallel NLI datasets, diverse transfer methodologies, and interpretable measurement protocols to probe the structure and limitations of cross-lingual generalization, with strong links to advances in multilingual pretraining, machine translation, data curation, and prompt-based transfer.

1. XNLI Dataset Design, Construction, and Variants

The core XNLI dataset was introduced by Conneau et al. as an extension of the Multi-Genre Natural Language Inference (MultiNLI) corpus from English into 14 additional languages (high-resource: fr, es, de, el, bg, ru, tr, ar, vi, th, zh, hi; low-resource: sw, ur) via professional translation. Each NLI example consists of a premise–hypothesis pair labeled as entailment, contradiction, or neutral. For each of 15 languages, the development and test sets contain 7,500 examples, totaling 112,500 labeled pairs. Dev and test are strictly parallel—each example is translated sentence-by-sentence, and gold labels are projected from English under the assumption that the semantic relation is preserved (Conneau et al., 2018).

Analysis reveals that 83–85% of semantic relations are preserved in translation (EN–FR check), but challenges exist for low-resource languages—especially for translations that disrupt NLI-relevant relations (e.g., polarity, idiom, or negation). XNLI 2.0 retrains the entire corpus with modern neural MT to address systematic artifacts in early versions, yielding a reduction of label disagreement by 12% and consistent cross-lingual accuracy gains of ≈3 points on average (Upadhyay et al., 2023). Additional variants such as IndicXNLI cover 11 Indian languages via high-quality, machine translation pipelines with rigorous human evaluation and annotation checks (Aggarwal et al., 2022). Recent extensions, such as XNLIeu (Basque) and myXNLI (Myanmar), combine MT-based translation with professional post-edition or multi-stage, community-based verification to maximize label fidelity and to expand XNLI coverage for truly low-resource languages (Heredia et al., 2024, Htet et al., 13 Apr 2025).

A critical finding is that translation artifacts—lexical divergency, loss of overlap, or meaning shift—disproportionately degrade accuracy in the lowest-resource languages. Comparison of zero-shot accuracy between human-translated (HT) test sets and machine-translated (MT) backtranslations reveals gaps (Δℓ) up to 10.9 points for swahili and 10.8 points for urdu, compared with only ≈2–3 points for high-resource languages (Agrawal et al., 2024). Manual re-annotation for Hindi and Urdu shows only 66.5% and 60% agreement with English gold labels, indicating systematic errors in low-resource test splits.

2. Transfer Methodologies: Zero-Shot, Translate-Train, and Translate-Test

XNLI has catalyzed rigorous comparison of three principal cross-lingual transfer paradigms:

Translate-Train: English NLI data is machine-translated into each target language, and a language-specific NLI classifier is trained per language. This approach can yield strong in-language accuracy but requires high-quality MT for each language, which is challenging for low-resource targets (Conneau et al., 2018).
Translate-Test: A robust English NLI model is trained and then test-time inference is conducted by translating premise–hypothesis pairs from the target language into English, where they are classified. Translate-test is simple and, as confirmed empirically, produces the highest or near-highest accuracy among traditional baselines for nearly all languages, especially when MT is strong (Agić et al., 2017, Conneau et al., 2018).
Zero-Shot Transfer (Encoder-based): Multilingual encoders (XLM, mBERT, XLM-R, Unicoder, MuRIL, IndicBERT, etc.) are trained or fine-tuned only on English labeled data. The model is then directly evaluated on NLI in an arbitrary target language without any target-language supervision. This scenario is the primary focus for both contextual embedding methods and meta-learning transfer (Lample et al., 2019, Huang et al., 2019, Wu et al., 2019, Zhou et al., 2022, Nooralahzadeh et al., 2020).

Aligned sentence-embedding baselines (e.g., X-CBOW, X-BiLSTM) leverage parallel corpora to align target and source encoders via distance-based or contrastive loss functions; these methods close some of the gap with translation-based approaches for resource-rich languages but lag for low-resource languages with little bitext (Conneau et al., 2018, Agić et al., 2017).

For new language additions (e.g. Basque, Myanmar), translate-train with post-edited MT data delivers the strongest results, especially when train/test origin matches. However, on natively constructed test sets, the translate-train versus zero-shot gap shrinks, and potential surface artifacts are reduced (Heredia et al., 2024, Htet et al., 13 Apr 2025).

3. Multilingual Encoder Pretraining for Cross-Lingual NLI

Progress in cross-lingual NLI is driven above all by advances in multilingual pretraining, most notably Masked Language Modeling (MLM) over shared subword vocabularies (BERT, mBERT, XLM, XLM-R, Unicoder, MuRIL). In essence, these models leverage extensive unlabeled corpora ( $C$ ), optimizing masked token recovery. For MLM, the loss is: $\mathcal{L}_{\mathrm{MLM}(\theta) = \mathbb{E}_{x\sim C}\left[ -\sum_{i\in \mathcal{M}(x)} \log P_\theta(x_i \mid x_{\setminus \mathcal{M}(x)}) \right]$ where $\mathcal{M}(x)$ denotes masked token positions (Lample et al., 2019, Huang et al., 2019).

Cross-lingual alignment can be further enhanced by exploiting parallel corpora:

Translation Language Modeling (TLM): Both source and target sentences (parallel pairs) are concatenated, and joint MLM is performed over random masked positions, permitting the encoder to attend across languages and directly reinforce cross-lingual token alignment (Lample et al., 2019): $\mathcal{L}_{\mathrm{TLM}(\theta)} = \mathbb{E}_{(x^{(s)},x^{(t)}) \sim \text{parallel}} \left[ -\sum_{i\in \mathcal{M}} \log P_\theta(u_i | u_{\setminus \mathcal{M}}) \right]$ with $u=[x^{(s)};x^{(t)}]$ .
Unicoder introduces additional objectives: cross-lingual word recovery (enforcing token-level alignment), cross-lingual paraphrase classification (sentence-level), and cross-lingual MLM over code-switched contexts, yielding robust gains of +0.7% average accuracy over XLM (Huang et al., 2019).

Evaluation protocols standardize on fine-tuning only on English NLI, then predicting directly in all target languages. On XNLI, XLM-R or Unicoder achieve 77.8–78.5% average accuracy (14 non-English languages), outperforming both mBERT and previous baselines (Lample et al., 2019, Huang et al., 2019).

In-depth error analyses reveal that massive pretraining yields robust transfer for morphologically rich and syntactically divergent languages, but persistent gaps remain for typologically distant and low-resource languages (sw, ur) (Huang et al., 2019, Aggarwal et al., 2022).

4. Data Augmentation, Prompt-based, and Meta-Learning Approaches

Multiple recent studies address persistent low-resource accuracy gaps by augmenting the XNLI paradigm:

XLDA (Cross-Lingual Data Augmentation): Constructing “mixed” training examples by substituting either the premise or hypothesis in a pair with its translation into another language, thus generating code-mixed input. Shuffled and mixed input with original data, this approach yields up to +4.8% gains in accuracy for lower-resource languages (sw, hi, ur), and is robust to translation quality (Singh et al., 2019).
Prompt Learning and Prompt-based Augmentation: Prompt-based methods (e.g., Universal Prompting, Dual Prompt Augmentation, Soft Prompting with Multilingual Verbalizer, Multilingual Prompt Translator) bypass the need for discrete template translation and instead use language-agnostic or learned soft prompts coupled with multilingual verbalizer mappings. Dual Prompt Augmentation combines answer-side (multilingual label tokens) and hidden-state input augmentations to optimize robustness in low-shot settings, with state-of-the-art few-shot improvements of up to +11 points over vanilla fine-tuning (Zhou et al., 2022, Li et al., 2023, Qiu et al., 2024).
Meta-Learning (X-MAML): Cross-lingual model-agnostic meta-learning operates by simulating rapid adaptation within low-resource auxiliary language subsets and seeks to “meta-optimize” the initialization for robust transfer. X-MAML + two auxiliaries raises average zero-shot accuracy by up to 3.7 points versus vanilla multi-BERT, with maximum gains for typologically similar pairs and clear links to language-internal features (e.g., locus of marking) (Nooralahzadeh et al., 2020).

Multiple works confirm that joint or sequential fine-tuning on high-resource plus moderate-quality monolingual data in the target language yields further 2–3% gains, especially for low-resource Indic and East Asian languages (Aggarwal et al., 2022, Hu et al., 2021).

5. Translation Artifacts, Evaluation Challenges, and Dataset Quality

XNLI data construction via independent translation of premise and hypothesis introduces systematic translation artifacts:

Lexical Overlap Drop: Independent translation reduces word overlap between premise/hypothesis, undermining surface-based inference models heavily reliant on lexical matching. For entailment pairs in particular, overlap is artificially deflated in the test set, causing a drop in model accuracy, and shifting the class-conditional distribution of overlap features (Artetxe et al., 2020).
Measuring Artifact Impact: Back-translation of English training sets through pivot languages to create “pseudo-translated” English training pairs, then matching the test set’s overlap shift during training, recovers up to +4.3 points (translate-test) and +2.8 points (zero-shot) in XNLI accuracy (Artetxe et al., 2020).
Translation Quality Gaps: Systematic analysis shows that low-resource languages in XNLI (sw, ur, hi) exhibit larger gaps ( $\Delta_{\ell}$ ) between human- and machine-translation test accuracy (up to 8–11 %), indicating severe label misalignments and translation-induced bias. Manual reannotation for Hindi/Urdu yields only 60–66% agreement with English labels (Agrawal et al., 2024). Conversely, automatic translation post-editing and manual verification—demonstrated in XNLIeu and myXNLI—produce consistent +3–6% gains and reduce semantic artifacts (Heredia et al., 2024, Htet et al., 13 Apr 2025).
Recommendations: New benchmarks should integrate systematic QC (Δℓ measurement), round-trip translation checks, targeted reannotation, and bias-aware annotation protocols to ensure high-quality cross-lingual evaluation—especially for low-resource languages (Agrawal et al., 2024).

6. Low-Resource Languages, Regional Benchmarks, and Best Practices

Efforts to extend XNLI beyond the initial 15 languages have focused on Indic languages (IndicXNLI), Basque (XNLIeu), Myanmar (myXNLI), and others. Best practices emerging from these works include:

High-Quality MT and Post-Edition: MT-derived training/dev/test sets combined with professional post-editing or community review achieves the highest possible label fidelity with limited resources (Heredia et al., 2024, Htet et al., 13 Apr 2025).
Sequential and Layered Fine-Tuning: For MuRIL and XLM-R, sequential fine-tuning, first on English NLI and then on translated or natively annotated low-resource data, yields the best transfer—boosting average accuracy by 2–3% over English-only fine-tuning (Aggarwal et al., 2022, Hu et al., 2021).
Code-Switching and Mixed-Input Augmentation: Cross-lingual augmentation (XLDA, code-switched prompting) is highly effective for low-resource targets, creating mixed language input which compels models to use deeper semantics (Singh et al., 2019, Li et al., 2023).
Prompting and Consistency Objectives: Prompt-based alignment, consistency regularization, and code-mixed augmentation enforce cross-lingual semantic invariance and are particularly advantageous in the few-shot and low-resource settings (Li et al., 2023, Qiu et al., 2024).
Genre Metadata and Parallelism: For languages like Myanmar, including genre tokens and exploiting the parallel structure of the dataset (en–my, my–en, my–my, en–en examples) further increases accuracy by +5–6 points (Htet et al., 13 Apr 2025).

7. Open Challenges and Future Directions

Persistent gaps between high-resource and low-resource languages in XNLI highlight fundamental challenges in multilingual representation learning:

Despite advances in corpus curation, translation artifacts and label projection errors remain a core limitation for truly robust cross-lingual evaluation (Agrawal et al., 2024).
Modern prompt-based and meta-learning methods outperform traditional fine-tuning, but their gains are bounded by the intrinsic quality of the source–target mapping and the representational depth of the underlying encoder (Zhou et al., 2022, Nooralahzadeh et al., 2020).
Language-specific neuron manipulation in LLMs does not independently improve XNLI transfer, highlighting the strongly entangled nature of multilingual representations (Mondal et al., 21 Mar 2025).
Evaluation on natively constructed test sets—devoid of translation artifacts—reduces the translate-train vs zero-shot gap and exposes models' true cross-lingual generalization (Heredia et al., 2024).
Best practices recommend human validation, rigorous error analysis, structured data augmentation, and, where feasible, accumulation of small natively-annotated NLI data for each language, rather than exclusive reliance on machine-translated resources (Hu et al., 2021, Aggarwal et al., 2022).

Ongoing and future research involves expanding XNLI to additional languages, integrating more robust QC into dataset construction and evaluation, leveraging more expressive pretraining objectives and adapters, and developing analytic tools to monitor and better understand cross-lingual generalization and its failure modes at scale.