Multilingual Counterfactuals
- Multilingual counterfactuals are minimally modified examples that flip model predictions to enable precise evaluation of factual knowledge, bias, and fairness.
- Researchers employ both direct LLM-based generation and translation-based methods to construct counterfactual datasets across diverse languages.
- Empirical studies highlight significant model gaps in non-English languages, driving advances in data augmentation, bias mitigation, and cross-lingual evaluation.
A multilingual counterfactual is an input—minimally modified from an original example in a given language—that results in a change to a model’s prediction, supporting precise measurement of factual knowledge, bias, model explanations, or generalization capacity across languages. Research on multilingual counterfactuals spans their generation, detection, and use as evaluation or augmentation tools in a variety of NLP settings, including factual knowledge probing, fairness audits, explanation frameworks, domain transfer, and model debiasing. Recent large-scale datasets, rigorous counterfactual generation protocols, and evaluation methodologies have enabled systematic analysis of multilingual LLMs' behavior, revealing pervasive gaps in their factuality, robustness, and fairness across linguistic and demographic axes.
1. Formal Definitions and Core Concepts
A counterfactual example for an original input (in language ) must:
- Be minimally edited from
- Cause a model to flip its prediction: but
- Maintain grammaticality and semantic coherence in
In factual knowledge evaluation, true (subject, relation, object) triples are complemented by counterfactuals where and is known to be false in the real world (2305.13675). For classification and detection problems—such as sentiment, bias, or acronym disambiguation—counterfactuals perturb a demographic attribute, class label, or context, subject to the minimality and validity constraints (Wang et al., 1 Jan 2026, Goldfarb-Tarrant et al., 2023, Weng et al., 2021).
In detection tasks, the label space is typically binary (counterfactual/non-counterfactual), and robust annotation processes are required due to the rarity and language-specificity of natural counterfactuals (O'Neill et al., 2021). Linguistic patterns for counterfactuality are highly variable and morphosyntactically marked—especially in non-English languages.
2. Dataset Construction and Multilingual Generation Strategies
Large-scale multilingual counterfactual datasets are built by generating and validating (true, false) fact pairs or by minimally perturbing source examples and translating or adapting them into multiple target languages. Two principal strategies dominate:
- Direct Generation (DG-CF): LLMs are prompted directly in the target language with chain-of-thought or minimal-edit instructions (Wang et al., 1 Jan 2026).
- Translation-Based Generation (TB-CF): Counterfactuals are first generated in English, then translated into target languages, either via machine translation or by prompting LLMs to act as high-quality translators.
For factual recall probing, facts from Wikidata/T-REx are translated using API-based machine translation (Google Translate) and aligned such that the cloze prompt format is preserved for cross-model comparability (2305.13675). Sampling heuristics are used to select plausible but false objects, with automated and manual validation to ensure non-triviality and semantic/grammatical correctness.
For English, German, Japanese, and other languages, counterfactual detection datasets rely on professional annotations and morphosyntactic adaptation to capture the full range of conditional, modal, and wish/irrealis constructions (O'Neill et al., 2021). Clue-phrase filtering, semantic similarity sampling, and morphologically informed template extension are employed to ensure both coverage and precision.
The table below summarizes representative multilingual datasets and their focus:
| Dataset / Paper | Languages | Construction Approach |
|---|---|---|
| Polyglot or Not? (2305.13675) | 20 (bg, ca, …) | Human+MT, factual triples |
| Multilingual CFD (O'Neill et al., 2021) | EN, DE, JP | Human annotation |
| Sentiment bias (Goldfarb-Tarrant et al., 2023) | JA, ZH, ES, DE | Template+linguist adaptation |
| Multilingual Counterfactuals (Wang et al., 1 Jan 2026) | EN, AR, DE, ES, HI, SW | LLM-based direct/translated |
| ADBCMM (Weng et al., 2021) | FR, ES, EN | Context mixing (counterfactual-by-foreign-sampling) |
3. Evaluation Metrics and Methodologies
Multilingual counterfactual evaluation employs rigorous, model-agnostic protocols. Core metrics include:
- Contrastive Knowledge Assessment (CKA): For factual fill-in tasks, ; the model is correct if CKA (2305.13675).
- Label Flip Rate (LFR): ; measures effectiveness of counterfactual generation (Wang et al., 1 Jan 2026).
- Minimality (Textual Similarity, TS): Cosine similarity of contextual embeddings, assessing degree of edit.
- Fluency: Perplexity under a reference multilingual LM.
- Bias and Fairness: Mean signed difference in outputs across paired sentences differing only in sensitive attribute (Goldfarb-Tarrant et al., 2023).
- Robustness to Clue Masking and Selectional Bias: Models are evaluated under clue-phrase masking or domain shift to assess linguistic reliance.
- Generalization: Cross-lingual transfer and augmentation are tested by training on counterfactuals in one language and evaluating in another (O'Neill et al., 2021, Weng et al., 2021).
4. Empirical Findings and Model Analysis
Empirical results reveal systematic gaps in the multilingual reliability of foundation models, LLMs, and classifiers:
- English-centric models, including LLaMA-65B and LLaMA-33B, achieve 89% accuracy on English factual recall using counterfactual accuracy as a benchmark, but scores drop below 80% for most other languages—especially non-Latin-script languages and languages with low training resource density (2305.13675).
- Multilingual counterfactual detection is strongly influenced by language-specific morphosyntactic cues. Lexical approaches fail to generalize when deprived of surface clues; contextualized transformer embeddings (mBERT, XLM-RoBERTa) demonstrate better robustness but still underperform with machine-translated training data, particularly for Japanese (O'Neill et al., 2021).
- Counterfactual-based bias evaluation uncovers higher average bias in Japanese, German, and Spanish sentiment systems (baseline SVM gender bias +0.6--+0.8; BERT-based models reduced mean bias by 60–80%) relative to English, showing that broad pre-training mitigates but does not eliminate demographic bias (Goldfarb-Tarrant et al., 2023).
- Automatic counterfactual generation degrades in validity, minimality, and fluency for low-resource languages; translation-based counterfactuals are more valid for label flips but require heavier edits and compromise text quality. Four error types—copy-paste, negation misapplication, inconsistencies, and language confusion—limit practical performance in generation (Wang et al., 1 Jan 2026).
- Multilingual counterfactual data augmentation (CDA) generally outperforms cross-lingual augmentation for low-resource languages, suggesting that language-matched perturbations are more effective despite the intrinsic noise and artifact rates in synthetic counterfactuals (Wang et al., 1 Jan 2026, Weng et al., 2021).
5. Counterfactual Probing and Model Internals
Advanced counterfactual probing frameworks such as AlterRep formally disentangle language-level information in multilingual transformer models. Linear classifiers trained to discriminate token language identity define directions in embedding space, and counterfactual interventions (null-space projection plus vector addition) allow targeted modification of the language prior within the model:
- Pushing representations toward a target language systematically increases the (masked LM) probability of both translation-equivalent and random words in that language, relative to a control, but does not amplify semantic equivalence (Srinivasan et al., 2023).
- These findings provide evidence that the embedding space partitions into a low-dimensional language-ID component and a more distributed, language-neutral semantic subspace.
This suggests that multilinear approaches to counterfactual probing are instrumental for both causal analysis and controlled interventions on multilingual LMs, e.g., for mechanism studies, controlled augmentation, or cross-lingual disentanglement.
6. Applications, Debiasing, and Data Augmentation
Applications of multilingual counterfactuals are broad and growing:
- Factuality and Knowledge Probing: Systematic auditing of encyclopedic recall, subject and demographic biases, and knowledge gaps in large LMs (2305.13675).
- Fairness and Bias Evaluation: Fine-grained analysis and mitigation of demographic, gender, and racial bias in sentiment analysis and other classification settings, with direct operationalization in data augmentation, regularization, and robustness auditing (Goldfarb-Tarrant et al., 2023).
- Detecting Counterfactual Language: Construction of reliable cross-lingual CFD detectors for consumer-facing NLP systems (O'Neill et al., 2021).
- Acronym Disambiguation and Low-Resource Transfer: Counterfactual data mixing—interleaving non-linguistic and multilingual contexts—balances data bias and regularizes overfitting in low-resource settings (Weng et al., 2021).
- Data Augmentation: CDA with in-language counterfactuals yields greater improvements for low-resource targets than cross-lingual transfer, conditional on the quality and minimality of generated perturbations (Wang et al., 1 Jan 2026).
A plausible implication is that careful curation and rigorous validation of counterfactuals—especially in low-resource and morphologically complex languages—are essential, and automated pipelines require additional filters (human-in-the-loop or artifact detection).
7. Limitations, Open Problems, and Future Directions
Despite substantial progress, significant limitations and open challenges remain:
- Resource Gaps: The best generative models still underperform in non-English, non-Latin, and low-resource languages, due to both limitations in training data and in cross-lingual alignment of counterfactual edits (2305.13675, Wang et al., 1 Jan 2026).
- Artifact Management: Automatic generation faces persistent issues (copy-paste, negation, incoherence, language confusion) that contaminate both training and evaluation (Wang et al., 1 Jan 2026).
- Annotation Protocol Diversity: Manual annotation remains necessary for high-precision counterfactual detection, and machine translation for CFD is suboptimal due to language-specific expression of irrealis and counterfactuality (O'Neill et al., 2021).
- Intervention Limitations: Probing techniques alter language bias and model priors but do not fully control semantic alignment or translation equivalence (Srinivasan et al., 2023).
Recommendations across studies include continued expansion to more typologically diverse languages, development of cross-lingual alignment techniques for counterfactual perturbation (e.g., MAPO post-training), more precise filtering and validation pipelines, and structured error analysis tied to end-application requirements (Wang et al., 1 Jan 2026, 2305.13675, O'Neill et al., 2021).
In summary, multilingual counterfactuals constitute a foundational methodology for precise, scalable, and interpretable measurement of model behavior across languages—enabling both critical diagnostic analysis and practical augmentation in multilingual and cross-lingual NLP research.