Multi-Language Switching Framework (MLSF)
- MLSF is a modular framework that synthesizes code-switched and multilingual NLI data by decoupling linguistic reasoning from lexical artifacts.
- It integrates synthetic data generation, neural machine translation, and embedding verification to validate semantic fidelity across diverse languages.
- Empirical results demonstrate that code-switching can regularize LLMs, enhancing cross-lingual reasoning and revealing performance gaps.
The Multi-Language Switching Framework (MLSF) is a modular paradigm for generating, evaluating, and leveraging code-switched and multilingual data in controlled experimental settings. It is designed to stress-test the logical and cross-lingual alignment capabilities of LLMs, decoupling linguistic reasoning from lexical artifacts and enabling rigorous comparison of monolingual versus mixed-LLM behavior. MLSF achieves this by synthesizing logic-based natural language inference (NLI) pairs, translating them across a typologically diverse language set, systematically constructing both monolingual and code-switched conditions, and validating semantic consistency through embedding analyses. Empirical findings from the MLSF pipeline demonstrate that code-switching can act as a regularizer, occasionally improving cross-lingual NLI performance and revealing characteristic brittleness in LLM multilingual reasoning (Abdaljalil et al., 20 Aug 2025).
1. Architecture and Workflow
MLSF is structured as a sequence of modular components enabling precise generation, translation, and evaluation of NLI data under diverse linguistic regimes:
- Synthetic NLI Data Generator: Employs abstract logic templates (Entailment, Contradiction, Neutral) instantiated over variable noun phrases A, B, C to generate premise–hypothesis pairs, ensuring controlled and unbiased semantic relations.
- Multilingual Neural Machine Translation (MT): Automatically translates each English pair to Arabic, German, French, Hindi, Swahili, creating language diversity across scripts and morphologies.
- Code-Switching Constructor: For every language pair (L₁, L₂), creates pairs where premise is in L₁ and hypothesis in L₂, populating a full 6×6 grid (monolingual diagonals and code-switched off-diagonals).
- LLM Evaluation Interface: Prompts LLMs for NLI classification (Entailment, Contradiction, Neutral) using greedy, low-temperature decoding to prioritize consistent decision-making.
- Embedding Verification: Language-agnostic embeddings (LaBSE) and UMAP projection are utilized for semantic fidelity checks and visualization of translation-induced alignment in embedding space.
The pipeline maintains strict separation between logical content and linguistic surface form, mitigating confounds in cross-lingual evaluation.
2. Formal NLI Representation
The synthetic data generation is governed by explicit logical or set-theoretic schemas:
- Entailment:
- Logical form: ∀x [P(x) ⇒ Q(x)]
- Example: “All A are B.” ⇒ “Some A are B.”
- Contradiction:
- Logical form: ∀x [P(x) ⇒ Q(x)] ⇒ ¬∃x [P(x) ∧ Q(x)]
- Example: “All A are B.” ⇒ “No A are B.”
- Neutral:
- Set-theoretic: A, B, C disjoint; “Some A are B.” ⇒ “Some A are C.”
- Logical: ∃x [P(x) ∧ Q(x)] and ∃x [P(x) ∧ R(x)], but Q(x) ∩ R(x) = ∅
Placeholders A, B, C are mapped to semantically plausible noun phrases to keep the language natural and contextually coherent.
3. Translation and Code-Switching Strategy
The translation module leverages state-of-the-art neural MT to maximize cross-script and morphological variance. Translation quality is systematically validated:
- Semantic embedding similarity: Cosine similarity between English source and target translation via LaBSE exceeds 0.81 for all languages, confirming high semantic preservation.
- Cluster consistency: UMAP reductions show language translations of the same English sentence form tight clusters, supporting the assertion that code-switched pairs preserve intended semantics.
For code-switching, the premise and hypothesis are permuted across all language combinations (including monolingual and cross-lingual), generating a balanced and comprehensive NLI evaluation matrix (36 cells × 1,000 examples per cell).
4. Evaluation Metrics and Experimental Protocol
MLSF adopts rigorous quantitative and qualitative metrics:
- Classification Accuracy:
where is the gold label and the prediction.
- Translation Validation:
ensuring high inter-lingual embedding alignment.
- Statistical Testing: While the original setup reports per-cell and aggregate accuracies without hypothesis tests, the framework is extensible to paired -test or McNemar’s tests for future monolingual vs. code-switched comparisons.
Experiments are conducted under zero-shot, prompt-based evaluation with GPU inference (A100), low-temperature, greedy decoding, and capped generation length.
5. Main Empirical Results
Analysis of model performance under MLSF yields several striking observations:
- Monolingual Baselines:
- Fanar-9B achieves top accuracy (65.1% in English, 60% in other languages); Gemma-7B lags at 17% (English).
- Language performance gap: En > Fr/De > Sw/Hi in most LLMs, with exceptions of balanced multilingual robustness.
- Code-Switching Effects:
- Code-switched cells often outperform the monolingual diagonal.
- Notable improvement: Gemma-7B (English→Hindi: 17.0% → 32.9%), Mistral-7B (Arabic→English: 28.2% → 36.4%).
- This suggests that translation-induced lexical and syntactic diversity may regularize models to attend to deeper logical cues, not just superficial lexical patterns.
- Semantic Embedding Analysis:
- UMAP visualization shows tight translation clusters, supporting genuine cross-lingual reasoning gaps (rather than translation-induced errors) as the performance bottleneck.
6. Design Implications and Recommendations
MLSF reveals that code-switching acts as a powerful regularization and analysis tool for LLM multilingual robustness:
- Regularization via Code-Switching: Construction of synthetic code-switched NLI pairs disrupts model reliance on language-specific artifacts, compelling models to align on logic rather than lexicon or structure.
- Template + Translation Hybrid: Combining programmatic logic templates with neural MT yields controlled logical coverage and script/morphological diversity, suggesting a scalable recipe for future multilingual LLM construction.
- Semantic Quality Loop: Embedding-based similarity thresholds () serve as an effective translation filter, safeguarding semantic fidelity across translation pipelines.
- Modular Extensibility: MLSF’s component-based design facilitates extension to additional languages (especially low-resource or typologically distanced ones) and supports future incorporation of complex logical relations.
- Statistical Module Integration: Future versions can integrate formal statistical testing for robust comparison across monolingual and code-switched performance, enhancing empirical rigor.
In summary, MLSF constitutes a reproducible, logic-grounded paradigm for evaluating and stress-testing LLMs under high-variance multilingual and code-switched conditions. The discovery that code-switching can enhance rather than confound logical NLI performance informs new approaches to multilingual LLM regularization and cross-lingual generalization (Abdaljalil et al., 20 Aug 2025).