Papers
Topics
Authors
Recent
Search
2000 character limit reached

Monolingual & Cross-lingual Evaluation

Updated 15 February 2026
  • Monolingual and cross-lingual evaluation conditions are defined as setups where models are tested on same-language data versus across languages to assess transfer and alignment.
  • The topic outlines experimental protocols including zero-shot and few-shot transfer, emphasizing the importance of dataset selection, metric design, and representational alignment.
  • Comparative results reveal that models excelling in monolingual tasks may underperform in cross-lingual scenarios, underscoring the need for language-agnostic evaluation strategies.

Monolingual and Cross-lingual Evaluation Conditions are foundational to research on multilingual NLP, representation learning, and model transfer, providing principled regimes for assessing how linguistic models capture, preserve, and generalize knowledge within and across languages. These conditions govern experiment design, metrics selection, and interpretation of model competence, directly impacting both intrinsic and extrinsic performance evaluations. Distinguishing and properly operationalizing monolingual and cross-lingual setups is critical for fair benchmarking, transferability analysis, and progress on language-agnostic technologies.

1. Definitional Distinctions and Core Principles

Monolingual evaluation refers to the scenario in which both training and testing occur in the same language. This condition measures a model's ability to capture intra-lingual phenomena such as semantics, syntax, or pragmatics based solely on source-language data. Examples include fine-tuning BERT or RoBERTa on English datasets and evaluating exclusively on English dev/test splits (Chi et al., 2019), or handling English-only MR, SST, and STS tasks in sentence representation settings (Chidambaram et al., 2018).

Cross-lingual evaluation encompasses any scenario where training and testing span more than one language. This includes two dominant paradigms:

  • Zero-shot cross-lingual transfer: Training occurs exclusively on labeled data in a source language (typically a resource-rich language such as English), and the resulting model is evaluated directly on test sets in a different (target) language, with no target-language supervised tuning. This setup tests a model's ability to transfer knowledge and maintain performance in multilingual contexts, as in train-on-English, test-on-foreign-language protocols (Chi et al., 2019, Chidambaram et al., 2018, Repo et al., 2021, Upadhyay et al., 2016).
  • Cross-lingual alignment/intrinsic evaluation: The primary objective is not downstream task generalization, but rather the measurement of representational or semantic convergence, such as cross-lingual word similarity, bilingual lexicon induction, or shared vector space quality (Vulić et al., 2020, Upadhyay et al., 2016, Doval et al., 2018).

Hybrid regimes exist, such as few-shot cross-lingual transfer—augmenting English training with small quantities of target-language labels for incremental adaptation (Chidambaram et al., 2018, Repo et al., 2021).

2. Evaluation Protocols and Task Design

Monolingual evaluation protocols are typically constrained to language-internal data. Representative protocols include:

By contrast, cross-lingual protocols require either cross-lingual task design, representational alignment, or transfer:

  • Zero-shot transfer: Train a multilingual or aligned model using source-language labels only, apply it directly to target-language test sets (CLS, XNLI, Amazon Reviews, register classification) (Chi et al., 2019, Chidambaram et al., 2018, Repo et al., 2021).
  • Intrinsic cross-lingual semantic evaluation: Measure similarity or retrieval quality between concepts or embeddings located in different language spaces (e.g., cross-lingual Multi-SimLex, bilingual dictionary induction) (Vulić et al., 2020, Upadhyay et al., 2016, Doval et al., 2018).
  • Cross-lingual document retrieval: Retrieve items in language ℓ' using a query in language ℓ, enforcing that retrieval and query must differ in language to truly test cross-lingual generalization (Ramponi et al., 28 May 2025).
  • Multilingual setups (contrast): “Multilingual” retrieval, as engineered in fact-check claim retrieval (Ramponi et al., 28 May 2025), does not distinguish input–output language pairs and thus can conflate monolingual and cross-lingual matches.

A key requirement in cross-lingual protocols is controlling for representational alignment, vocabulary mapping, and, in some scenarios, the "curse of multilinguality"—where model capacity is divided among many languages, degrading per-language task optimality (Chi et al., 2019, Vulić et al., 2020).

3. Datasets, Metrics, and Annotation

Evaluation design hinges on dataset selection and metric choice. Standard dataset construction practices and annotation protocols include:

4. Model Formulations and Evaluation Regime Impact

Experimental regimes dictate model setup, learning objectives, and knowledge transfer formulations:

  • Monolingual fine-tuning: Cross-entropy or task-specific objectives applied to purely in-language data, with performance reported on in-language test splits (Chi et al., 2019, Gogoulou et al., 2021).
  • Cross-lingual fine-tuning: Multilingual models (e.g., mBERT, XLM-R) are fine-tuned on source-language data and evaluated on target-language tasks in zero-shot mode; dual-encoder models extend this logic to multiple language pairs (Chidambaram et al., 2018, Repo et al., 2021, Chi et al., 2019).
  • Knowledge transfer strategies: MonoX-Kd knowledge distillation uses monolingual teachers to inject “stronger” decision boundaries into multilingual students using only source-language data; pseudo-labeling (MonoX-Pl) augments training with teacher-generated labels on in-language unlabeled data (Chi et al., 2019).
  • Cross-lingual alignment methods: Orthogonal Procrustes, VecMap, MUSE, and adversarial GANs align independently trained monolingual spaces, varying in reliance on seed lexicons and self-learning (Upadhyay et al., 2016, Doval et al., 2019, Doval et al., 2018, Ulčar et al., 2021). Attract-Repel imposes cross-lingual semantic constraints to force joint embedding spaces (Mrkšić et al., 2017).

Choice of evaluation condition fundamentally alters reported performance. For instance, monolingual BERT models consistently outperform massively multilingual BERTs in same-language setups, yet the best cross-lingual results are typically achieved by models with targeted multilingual pretraining or explicit alignment (Ulčar et al., 2021, Chi et al., 2019, Chidambaram et al., 2018).

5. Comparative Results and Analysis

Empirical comparisons of monolingual versus cross-lingual conditions reveal several consistent patterns:

Setting Task/type Monolingual Best Cross-lingual Best Gap/Observation
Sentiment/CLS acc % RoBERTa 355M: 95.77 MonoX-Kd: 85.05 ~+10 pts drop in zero-shot
XNLI acc % RoBERTa 355M: 89.24 MonoX-Kd: 62.4 ~+27 pts drop in zero-shot
Reg. Class. [2102] macro-F₁ XLM-R (Sv–Sv): 83.04 XLM-R (En→Sv): 69.22 Zero-shot Fr/Sv shortfall of 7–10 pts
Multi-SimLex Spearman ρ mono: 0.65 cross: 0.45–0.70 Unsupervised alignment fails for distant/lr pairs
Retrieval [2505.22] S@10 0.8324 0.7283 LLM-based re-ranking narrows drop to ~10 pp
MT human eval. mean rating 3.7 (mono), 3.8 (bi) 98% errors detected monolingually

Key findings:

6. Human Evaluation: Monolingual vs. Bilingual Conditions

Human evaluation regimes in MT and generative tasks further illustrate the interplay between monolingual and cross-lingual (bilingual) protocols:

  • Monolingual human evaluation: Assessors judge only the target-language output (with full document context) for fluency, adequacy (as inferred), and contextual coherence, using Likert scales and error taxonomies. ≈ 98% of errors, including major ones, are detected without source-reference, provided sufficient context (Picinini et al., 10 Apr 2025).
  • Bilingual human evaluation: Access to both source and target enables refined identification of adequacy errors (e.g., dropped modifiers, semantic errors invisible in the monolingual case).
  • Monolingual evaluation is faster and more scalable (no source language required), while bilingual evaluation offers marginally higher error spotting at greater annotation cost.

Implication: context-aware monolingual evaluation is viable for most practical settings and may suffice for efficient system assessment across languages, especially as most critical errors are observable in-context (Picinini et al., 10 Apr 2025).

7. Implications, Limitations, and Recommendations

The choice of monolingual versus cross-lingual evaluation conditions is not merely a technicality—it defines the claims that can be made about model generalization, alignment, and transfer strength:

  • Rigorous cross-lingual evaluation exposes failures of apparent alignment (semantic drift, typological sensitivity, resource constraints). Strong monolingual results are not reliable predictors of cross-lingual performance (Vulić et al., 2020, Doval et al., 2019).
  • Intrinsic and extrinsic task design must leverage datasets, metrics, and evaluation boundaries appropriate to the intended regime, e.g., strict language-pair filtering for cross-lingual retrieval (Ramponi et al., 28 May 2025).
  • For fair benchmarking, best practices include multi-factor dataset balancing, careful annotation adjudication, and the use of both monolingual and cross-lingual test sets (Vulić et al., 2020, Repo et al., 2021).
  • In dialogue and generative evaluation, adversarial multi-task models outperform monolingual baselines by learning language-invariant features (Tong et al., 2018).
  • For LLMs, explicit persona-prompting modulates performance substantially between monolingual and bilingual conditions, affecting both outputs and internal representations. Standardized reporting of linguistic conditioning in evaluation protocols is recommended (Yuan et al., 4 Aug 2025).
  • Fully unsupervised cross-lingual alignment methods are brittle; mild bilingual supervision stabilizes results. Post-processing and meta-embedding further boost robustness (Doval et al., 2019, García-Ferrero et al., 2020).

In sum, carefully delineated monolingual and cross-lingual evaluation conditions are necessary for the principled development, analysis, and deployment of multilingual language technologies. The convergence of task design, dataset construction, representational alignment, and evaluation boundary defines both the limits and potential of models to function as universal—rather than just monolingual—reasoners.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Monolingual and Cross-lingual Evaluation Conditions.