Monolingual & Cross-lingual Evaluation

Updated 15 February 2026

Monolingual and cross-lingual evaluation conditions are defined as setups where models are tested on same-language data versus across languages to assess transfer and alignment.
The topic outlines experimental protocols including zero-shot and few-shot transfer, emphasizing the importance of dataset selection, metric design, and representational alignment.
Comparative results reveal that models excelling in monolingual tasks may underperform in cross-lingual scenarios, underscoring the need for language-agnostic evaluation strategies.

Monolingual and Cross-lingual Evaluation Conditions are foundational to research on multilingual NLP, representation learning, and model transfer, providing principled regimes for assessing how linguistic models capture, preserve, and generalize knowledge within and across languages. These conditions govern experiment design, metrics selection, and interpretation of model competence, directly impacting both intrinsic and extrinsic performance evaluations. Distinguishing and properly operationalizing monolingual and cross-lingual setups is critical for fair benchmarking, transferability analysis, and progress on language-agnostic technologies.

1. Definitional Distinctions and Core Principles

Monolingual evaluation refers to the scenario in which both training and testing occur in the same language. This condition measures a model's ability to capture intra-lingual phenomena such as semantics, syntax, or pragmatics based solely on source-language data. Examples include fine-tuning BERT or RoBERTa on English datasets and evaluating exclusively on English dev/test splits (Chi et al., 2019), or handling English-only MR, SST, and STS tasks in sentence representation settings (Chidambaram et al., 2018).

Cross-lingual evaluation encompasses any scenario where training and testing span more than one language. This includes two dominant paradigms:

Zero-shot cross-lingual transfer: Training occurs exclusively on labeled data in a source language (typically a resource-rich language such as English), and the resulting model is evaluated directly on test sets in a different (target) language, with no target-language supervised tuning. This setup tests a model's ability to transfer knowledge and maintain performance in multilingual contexts, as in train-on-English, test-on-foreign-language protocols (Chi et al., 2019, Chidambaram et al., 2018, Repo et al., 2021, Upadhyay et al., 2016).
Cross-lingual alignment/intrinsic evaluation: The primary objective is not downstream task generalization, but rather the measurement of representational or semantic convergence, such as cross-lingual word similarity, bilingual lexicon induction, or shared vector space quality (Vulić et al., 2020, Upadhyay et al., 2016, Doval et al., 2018).

Hybrid regimes exist, such as few-shot cross-lingual transfer—augmenting English training with small quantities of target-language labels for incremental adaptation (Chidambaram et al., 2018, Repo et al., 2021).

2. Evaluation Protocols and Task Design

Monolingual evaluation protocols are typically constrained to language-internal data. Representative protocols include:

Training a model (e.g., RoBERTa, BERT) on in-language data, then evaluating on held-out splits in the same language, without exposure to any other languages or parallel corpora (Chi et al., 2019, Gogoulou et al., 2021).
Intrinsic similarity evaluation (e.g., SimLex-999, CoSimLex, MR, CR) using human-annotated same-language word pairs or sentences (Upadhyay et al., 2016, Chidambaram et al., 2018, Ulčar et al., 2021, Vulić et al., 2020).
Extrinsic, language-specific downstream tasks, such as classification (GLUE, SentEval), POS tagging, or dialogue state tracking, using only monolingual supervision (Chidambaram et al., 2018, Mrkšić et al., 2017).

By contrast, cross-lingual protocols require either cross-lingual task design, representational alignment, or transfer:

Zero-shot transfer: Train a multilingual or aligned model using source-language labels only, apply it directly to target-language test sets (CLS, XNLI, Amazon Reviews, register classification) (Chi et al., 2019, Chidambaram et al., 2018, Repo et al., 2021).
Intrinsic cross-lingual semantic evaluation: Measure similarity or retrieval quality between concepts or embeddings located in different language spaces (e.g., cross-lingual Multi-SimLex, bilingual dictionary induction) (Vulić et al., 2020, Upadhyay et al., 2016, Doval et al., 2018).
Cross-lingual document retrieval: Retrieve items in language ℓ' using a query in language ℓ, enforcing that retrieval and query must differ in language to truly test cross-lingual generalization (Ramponi et al., 28 May 2025).
Multilingual setups (contrast): “Multilingual” retrieval, as engineered in fact-check claim retrieval (Ramponi et al., 28 May 2025), does not distinguish input–output language pairs and thus can conflate monolingual and cross-lingual matches.

A key requirement in cross-lingual protocols is controlling for representational alignment, vocabulary mapping, and, in some scenarios, the "curse of multilinguality"—where model capacity is divided among many languages, degrading per-language task optimality (Chi et al., 2019, Vulić et al., 2020).

3. Datasets, Metrics, and Annotation

Evaluation design hinges on dataset selection and metric choice. Standard dataset construction practices and annotation protocols include:

Monolingual datasets: Balanced coverage of part-of-speech, frequency, lexical field, and semantic similarity (e.g., Multi-SimLex monolingual sets) (Vulić et al., 2020). Benchmarks such as GLUE (Gogoulou et al., 2021), SentEval (Chidambaram et al., 2018), and SimLex-999 (Upadhyay et al., 2016) underpin monolingual lexical and compositional evaluation.
Cross-lingual datasets: Aligned from monolingual sources by strict translation procedures preserving semantic equivalence, supplemented by systematic filtering for semantic drift (Δ similarity ≤ 1.2 in Multi-SimLex cross-lingual construction) (Vulić et al., 2020). Zero-shot transfer requires parallel test sets (e.g., XNLI, Amazon Reviews, SNLI-X, CLS) (Chi et al., 2019, Chidambaram et al., 2018).
Annotation protocols: Strong inter-annotator agreement is enforced by adjudication rounds, pairwise rho calculations, and multi-stage outlier removal (Vulić et al., 2020).
Metrics: Spearman's ρ for similarity ranking (Vulić et al., 2020, Upadhyay et al., 2016, Doval et al., 2018), macro-F₁ for multi-label classification (Repo et al., 2021), accuracy for classification (Chi et al., 2019, Gogoulou et al., 2021), Pearson's r for regression and sentence similarity (Chidambaram et al., 2018, García-Ferrero et al., 2020), precision@k for retrieval tasks (Upadhyay et al., 2016, Doval et al., 2018), and task-specific label/goal metrics (joint-goal DST (Mrkšić et al., 2017), S@10 and MRR@10 for IR (Ramponi et al., 28 May 2025)). MQM frameworks with major/minor error distinctions are adopted for human MT evaluation (Picinini et al., 10 Apr 2025).

4. Model Formulations and Evaluation Regime Impact

Experimental regimes dictate model setup, learning objectives, and knowledge transfer formulations:

Monolingual fine-tuning: Cross-entropy or task-specific objectives applied to purely in-language data, with performance reported on in-language test splits (Chi et al., 2019, Gogoulou et al., 2021).
Cross-lingual fine-tuning: Multilingual models (e.g., mBERT, XLM-R) are fine-tuned on source-language data and evaluated on target-language tasks in zero-shot mode; dual-encoder models extend this logic to multiple language pairs (Chidambaram et al., 2018, Repo et al., 2021, Chi et al., 2019).
Knowledge transfer strategies: MonoX-Kd knowledge distillation uses monolingual teachers to inject “stronger” decision boundaries into multilingual students using only source-language data; pseudo-labeling (MonoX-Pl) augments training with teacher-generated labels on in-language unlabeled data (Chi et al., 2019).
Cross-lingual alignment methods: Orthogonal Procrustes, VecMap, MUSE, and adversarial GANs align independently trained monolingual spaces, varying in reliance on seed lexicons and self-learning (Upadhyay et al., 2016, Doval et al., 2019, Doval et al., 2018, Ulčar et al., 2021). Attract-Repel imposes cross-lingual semantic constraints to force joint embedding spaces (Mrkšić et al., 2017).

Choice of evaluation condition fundamentally alters reported performance. For instance, monolingual BERT models consistently outperform massively multilingual BERTs in same-language setups, yet the best cross-lingual results are typically achieved by models with targeted multilingual pretraining or explicit alignment (Ulčar et al., 2021, Chi et al., 2019, Chidambaram et al., 2018).

5. Comparative Results and Analysis

Empirical comparisons of monolingual versus cross-lingual conditions reveal several consistent patterns:

Setting	Task/type	Monolingual Best	Cross-lingual Best	Gap/Observation
Sentiment/CLS	acc %	RoBERTa 355M: 95.77	MonoX-Kd: 85.05	~+10 pts drop in zero-shot
XNLI	acc %	RoBERTa 355M: 89.24	MonoX-Kd: 62.4	~+27 pts drop in zero-shot
Reg. Class. [2102]	macro-F₁	XLM-R (Sv–Sv): 83.04	XLM-R (En→Sv): 69.22	Zero-shot Fr/Sv shortfall of 7–10 pts
Multi-SimLex	Spearman ρ	mono: 0.65	cross: 0.45–0.70	Unsupervised alignment fails for distant/lr pairs
Retrieval [2505.22]	S@10	0.8324	0.7283	LLM-based re-ranking narrows drop to ~10 pp
MT human eval.	mean rating	3.7 (mono), 3.8 (bi)	–	98% errors detected monolingually

Key findings:

Monolingual “upper bounds” markedly outperform zero-shot cross-lingual evaluations; however, modern pretraining and knowledge transfer (MonoX-Kd/Pl, few-shot, LLM rerank) substantially narrow these gaps (Chi et al., 2019, Chidambaram et al., 2018, Ramponi et al., 28 May 2025).
Cross-lingual transfer is most effective for semantically similar languages or high-quality/large parallel resources; performance degrades sharply in low-resource or typologically distant pairings, especially in unsupervised settings (Doval et al., 2019, Upadhyay et al., 2016).
Explicit cross-lingual constraints and alignment, meta-embedding strategies, and adversarial multi-task learning produce shared spaces adequate for zero-shot generalization, but specialization or hybrid strategies (few-shot or self-training) are often needed to match strong in-language results (Mrkšić et al., 2017, Chidambaram et al., 2018, Ulčar et al., 2021).

6. Human Evaluation: Monolingual vs. Bilingual Conditions

Human evaluation regimes in MT and generative tasks further illustrate the interplay between monolingual and cross-lingual (bilingual) protocols:

Monolingual human evaluation: Assessors judge only the target-language output (with full document context) for fluency, adequacy (as inferred), and contextual coherence, using Likert scales and error taxonomies. ≈ 98% of errors, including major ones, are detected without source-reference, provided sufficient context (Picinini et al., 10 Apr 2025).
Bilingual human evaluation: Access to both source and target enables refined identification of adequacy errors (e.g., dropped modifiers, semantic errors invisible in the monolingual case).
Monolingual evaluation is faster and more scalable (no source language required), while bilingual evaluation offers marginally higher error spotting at greater annotation cost.

Implication: context-aware monolingual evaluation is viable for most practical settings and may suffice for efficient system assessment across languages, especially as most critical errors are observable in-context (Picinini et al., 10 Apr 2025).

7. Implications, Limitations, and Recommendations

The choice of monolingual versus cross-lingual evaluation conditions is not merely a technicality—it defines the claims that can be made about model generalization, alignment, and transfer strength:

Rigorous cross-lingual evaluation exposes failures of apparent alignment (semantic drift, typological sensitivity, resource constraints). Strong monolingual results are not reliable predictors of cross-lingual performance (Vulić et al., 2020, Doval et al., 2019).
Intrinsic and extrinsic task design must leverage datasets, metrics, and evaluation boundaries appropriate to the intended regime, e.g., strict language-pair filtering for cross-lingual retrieval (Ramponi et al., 28 May 2025).
For fair benchmarking, best practices include multi-factor dataset balancing, careful annotation adjudication, and the use of both monolingual and cross-lingual test sets (Vulić et al., 2020, Repo et al., 2021).
In dialogue and generative evaluation, adversarial multi-task models outperform monolingual baselines by learning language-invariant features (Tong et al., 2018).
For LLMs, explicit persona-prompting modulates performance substantially between monolingual and bilingual conditions, affecting both outputs and internal representations. Standardized reporting of linguistic conditioning in evaluation protocols is recommended (Yuan et al., 4 Aug 2025).
Fully unsupervised cross-lingual alignment methods are brittle; mild bilingual supervision stabilizes results. Post-processing and meta-embedding further boost robustness (Doval et al., 2019, García-Ferrero et al., 2020).

In sum, carefully delineated monolingual and cross-lingual evaluation conditions are necessary for the principled development, analysis, and deployment of multilingual language technologies. The convergence of task design, dataset construction, representational alignment, and evaluation boundary defines both the limits and potential of models to function as universal—rather than just monolingual—reasoners.