Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance

Published 12 Apr 2026 in cs.CL and cs.AI | (2604.10590v1)

Abstract: Multilingual LLMs struggle with cross-lingual tasks due to data imbalances between high-resource and low-resource languages, as well as monolingual bias in pre-training. Existing methods, such as bilingual fine-tuning and contrastive alignment, can improve cross-lingual performance, but they often require extensive parallel data or suffer from instability. To address these challenges, we introduce a Cross-Lingual Mapping Task during the pre-training phase, which enhances cross-lingual alignment without compromising monolingual fluency. Our approach bi-directionally maps languages within the LLM embedding space, improving both language generation and comprehension. We further propose a Language Alignment Coefficient to robustly quantify cross-lingual consistency, even in limited-data scenarios. Experimental results on machine translation (MT), cross-lingual natural language understanding (CLNLU), and cross-lingual question answering (CLQA) show that our model achieves gains of up to 11.9 BLEU points in MT, 6.72 points in CLQA BERTScore-Precision, and more than 5% in CLNLU accuracy over strong multilingual baselines. These findings highlight the potential of incorporating cross-lingual objectives into pre-training to improve multilingual LLMs.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces a cross-lingual mapping loss integrated with next-token prediction to enhance multilingual LLM performance and mitigate catastrophic forgetting.
It reports significant improvements with increased BLEU scores, BERTScore-Precision, and accuracy, especially for low-resource languages like Czech and Ukrainian.
A novel Language Alignment Coefficient is proposed to robustly quantify cross-lingual consistency and provide a stable alignment metric.

Cross-Lingual Mapping in Pre-Training for Enhanced Multilingual LLM Performance

Motivation and Problem Formulation

The increasing deployment of multilingual LLMs has exposed persistent inadequacies in cross-lingual generalization, especially affecting low-resource languages. Standard monolingual next-token prediction (NTP) pre-training accentuates monolingual fluency but fails to promote robust cross-lingual alignment, resulting in sub-optimal transfer for key tasks like machine translation (MT), cross-lingual question answering (CLQA), and natural language understanding (CLNLU). Bilingual fine-tuning and contrastive alignment methods encounter instability or rely on expensive parallel corpora and are largely incompatible with decoder-only LLMs. This motivates novel pre-training objectives and datasets capable of fostering stronger, instruction-agnostic cross-lingual correspondences during LLM training.

Cross-Lingual Mapping Objective and Pre-Training Framework

The paper introduces a cross-lingual mapping (CL) loss symmetrically integrated into continued pre-training alongside the NTP objective using a decoder-only architecture. NTP optimizes negative log-likelihood over monolingual sequences, supporting retention of high-resource language proficiency and mitigating catastrophic forgetting by maintaining a balanced batch composition across language strata. The CL task, in contrast, trains the model to autoregressively generate target-language sequences conditioned on full source-language inputs, using parallel sentence pairs. This dual formulation enforces alignment in the shared embedding space and enhances both multilingual comprehension and generative transfer, without explicit translation instructions or architectural modifications.

Figure 1: Schematic of continued pre-training, highlighting the interplay between Next-Token Prediction and Cross-Lingual Mapping; "<s>" denotes the start token and $m_i$ the mask for the $i$ th token.

Empirically, the proposed CL objective is generalizable beyond supervised translation by promoting implicit semantic transfer across languages. This differentiates it sharply from prior approaches focusing on contrastive alignment, bilingual masking, or translation-instruction-based pre-training, and is uniquely suited for scalable extension to low-resource or typologically divergent language pairs.

Language Alignment Coefficient: Quantifying Cross-Lingual Consistency

Addressing weaknesses in established cosine similarity or projection variance-based alignment metrics, the study proposes the Language Alignment Coefficient (LAC), defined as the average embedding cosine similarity across layers normalized by standard deviation (the inverse of Coefficient of Variation). This metric provides robust quantification of alignment magnitude and stability, offering resilience against outliers and variability in shallow or noisy test sets. Higher LAC directly reflects enhanced cross-lingual coherence, especially crucial for evaluating models in low-resource or domain-shifted conditions.

Figure 2: Comparative analysis of language alignment (LAC) across model variants, indicating the dominant contribution of the joint CL objective in boosting cross-lingual consistency.

Data Curation and Setup

Experiments span diverse typological pairs, including high-resource (EN-ZH, ZH-JP), mid-resource (EN-CS), and low-resource (CS-UK) scenarios. The evaluation encompasses MT, CLQA, CLNLU, and summarization, leveraging datasets such as Cleaned Alpaca, CrossSum, OpenHermes-2.5, XQuAD, and Belebele; pre-training incorporates approximately 4.1M monolingual and 3.7M parallel cross-lingual sentence instances. Careful separation of training and evaluation splits ensures uncompromised generalization assessment.

Empirical Results

Language Modeling and Alignment

The joint NTP+CL pre-training schema (Llama_CL) produces substantial cross-lingual improvements over monolingual (Llama_NTP) and bilingual (Llama_Bi_NTP) NTP baselines, with complementary reductions in perplexity across all languages under test, notably for low-resource Czech and Ukrainian. Explicit modeling of bilingual mappings produces consistent LAC elevation and improved latent alignment, outperforming models with code-switched or instruction-tuned translation data. The gains extend to zero-shot language pairs, indicating generalization beyond observed alignments.

Downstream Task Performance

MT demonstrates BLEU increases up to 11.9 points (e.g., EN-CS), BERTScore-Precision increases of over 6.7 in CLQA, and >5% accuracy improvements in CLNLU relative to strong instruction-tuned Llama-3-8B baselines using identical downstream datasets. Comparison with models incorporating explicit translation instructions in pre-training reveals that such instruction signals can induce overfitting and limit general-purpose cross-lingual transfer, a phenomenon not observed in the instruction-agnostic CL approach. Notably, the proposed methodology yields even stronger relative improvements when applied to weaker base architectures (e.g., BLOOM-3B), suggesting its suitability as a universal enhancement.

Open-Ended and Reference-Free QA

LLM-assisted evaluation on open-ended QA using human and model-based judgments verifies consistent generative quality improvements, both in content and stylistic complexity, directly attributable to stronger semantic alignment from CL pre-training.

Ablation and Analysis

Ablations confirm that data volume is not solely responsible for the gains; embedding cross-lingual tasks as CL objectives in pre-training outperforms both pre- and post-hoc instruction-tuning on parallel data. The CL loss provides an alignment inductive bias unavailable from increased parallel data alone, and does not compromise performance on English monolingual understanding, as demonstrated with MMLU, LogiQA, and AlpacaEval benchmarks.

Case studies further reveal reduced source-language copying, improved treatment of rare words, and decreased translation/reasoning errors relative to NTP or instruction-tuned translation counterparts.

Theoretical and Practical Implications

The cross-lingual mapping paradigm clarifies that explicit pre-training objectives, rather than task-specific fine-tuning or reliance on large parallel corpora after pre-training, are necessary for robust and transferable cross-lingual representation learning in decoder-only LLMs. Importantly, the improved flexibility and stability in alignment metrics enable scalable evaluation pipelines for multilingual models, especially in the context of limited or noisy resources.

The approach aligns with trends observed in recent surveys and empirical studies highlighting the need for architecture-agnostic, instruction-agnostic, and data-efficient cross-lingual adaptation [see, e.g., (Singh et al., 2024), 2024.emnlp-main.572].

Conclusion

The paper introduces an instruction-agnostic, cross-lingual mapping objective for decoder-only LLMs, supported by a robust alignment metric (LAC) and extensive empirical evaluation spanning MT, CLQA, CLNLU, and summarization. The results demonstrate clear superiority over conventional fine-tuning and bilingual NTP strategies, yielding pronounced BLEU and BERTScore-Precision gains. This methodology establishes a scalable template for future multilingual LLM research, with immediate practical implications in both high- and low-resource language settings. Future research should address remaining gaps in abstractive summarization and logical reasoning, and extend the framework to broader linguistic coverage and more complex cross-lingual reasoning paradigms.

References

(2604.10590)

Markdown Report Issue