Cross-Lingual Continuous Pretraining

Updated 15 December 2025

Cross-lingual continuous pretraining is a method where a multilingual model is further optimized using targeted objectives and diverse data to improve language alignment.
It employs techniques like masked language modeling, translation language modeling, and contrastive losses to enhance zero-shot transfer and downstream task performance.
Practical insights include using typologically diverse corpora, interleaved objective scheduling, and strategies to mitigate catastrophic forgetting during pretraining.

Continuous pretraining in cross-lingual settings refers to further optimizing a LLM—already pretrained on monolingual or multilingual corpora—using objectives and data that strengthen its ability to align, represent, and transfer knowledge across languages. Cross-lingual continuous pretraining (CLCP) encompasses numerous strategies, including continued masked language modeling on selected languages, leveraging parallel or comparable corpora, applying contrastive or alignment-based losses, and carefully mixing supervised and unsupervised objectives. This paradigm is foundational for high-quality zero-shot cross-lingual transfer, improved multilingual understanding, and effective adaptation to new languages, directly impacting downstream applications in translation, retrieval, information extraction, and beyond.

1. Core Objectives and Pretraining Strategies

Cross-lingual continuous pretraining generally extends initial (monolingual or multilingual) pretraining with additional objectives and/or targeted data, seeking to optimize representational alignment and task transfer.

Masked Language Modeling (MLM) Adaptation:

Continued MLM on target language(s) boosts zero-shot performance, especially when selecting a diverse, script-varied training set (Fujinuma et al., 2022, Lample et al., 2019). The loss: $L_{\mathrm{MLM}} = -\sum_{i \in M} \log P(x_i | \tilde{x}_{\setminus i}; \theta)$ remains the standard for representation learning, with uniform sampling across languages to avoid corpus imbalance.

Translation Language Modeling (TLM):

With access to parallel corpora, TLM concatenates sentence pairs and masks tokens across both, optimizing cross-lingual contextual prediction. This explicit context mixing encourages alignment at the sentence level (Lample et al., 2019): $L_{\mathrm{TLM}}(\theta) = -\sum_{(x,y)}\left[\sum_{t \in M_x} \log p(x_t | x_{\setminus M_x}, y; \theta) + \ldots \right]$

Sentence-level Pretraining with Cross-lingual Structure:

ParaLaw Nets demonstrate that tasks such as Next Foreign Sentence Prediction (NFSP) and Neighbor Multilingual Sentence Prediction (NMSP) can utilize explicit cross-language neighbor relations as supervision signals, formulated as classification problems with cross-entropy losses (Nguyen et al., 2021).

Contrastive and Alignment-based Objectives:

Contrastive objectives (e.g., InfoNCE) pull representations of translation equivalents closer while pushing apart non-equivalents, at both sentence and word granularity (Chen et al., 2022, Li et al., 2023). Token-level MaxSim, cross-zero NCE, and multi-level contrastive learning are effective choices: $L_{\mathrm{infoNCE}}(x, y) = -\log \frac{\exp(\cos(x, y)/\tau)}{\sum_{j}\exp(\cos(x, k_j)/\tau)}$

Auxiliary Pretext Tasks:

Injecting pretext tasks such as code-switching restore (CSR) reduces domain/task gaps by compelling the model to denoise input sequences containing randomly inserted pseudo-translation spans, thereby shrinking cross-lingual representation distances and speeding up convergence (Zan et al., 2022).

2. Data Selection and Corpus Construction

Monolingual and Multilingual Corpora:

Public resources (Wikipedia, CC100, mC4, web crawls) supply scalable monolingual and multilingual datasets, critical for MLM-based adaptation and analysis of pretraining dynamics (Blevins et al., 2022).

Parallel and Comparable Data:

High-quality parallel corpora (e.g., UN Parallel, Europarl, OPUS, Bactrian-X, Japanese-English Law) underpin TLM and contrastive pretraining. Where parallel data is scarce, comparable corpora (Wikipedia interwiki links, retrieved web passages) can serve for weak contrastive supervision (Yang et al., 2022, Wu et al., 29 Apr 2025). Algorithms for paragraph-level alignment, sliding-window segmentation, and retrieval-augmented pairing maximize sample richness under context length constraints.

Semantic Augmentation:

Semantic retrieval extends the availability of in-context pairs beyond strict factual overlap, enabling large-scale augmentation of cross-lingual contexts using embeddings and approximate nearest neighbor search (Wu et al., 29 Apr 2025).

Corpus Type	Example Use	Reference
Parallel	TLM, contrastive alignment	(Lample et al., 2019, Chen et al., 2022)
Comparable	Weakly supervised document-level loss	(Yang et al., 2022)
Monolingual	MLM adaptation, dynamic sampling	(Fujinuma et al., 2022, Blevins et al., 2022)

3. Model Architectures and Adaptation Approaches

Encoder-oriented Models:

Models such as mBERT, XLM-R, and Info-XLM follow a standard 12-layer Transformer encoder, with modifications in embedding distributions and tokenization to accommodate multilingual vocabularies (Blevins et al., 2022, Chen et al., 2022).

Seq2Seq and Decoder-only LLMs:

Encoder-decoder (mBART, mT5, XLM) and decoder-only (LLaMA, BLOOM, XGLM) architectures each support cross-lingual adaptation. For LLMs, continual pretraining (CPT or CLCP) on a new language via the original self-supervised objective is highly compute-efficient and scales favorably in compute–loss tradeoffs (Zheng et al., 2 Jul 2024).

Architecture Modifications:

Most CLCP methods do not require major architectural changes; they instead rely on task-driven or loss-driven input composition, retraining strategies (e.g., LoRA for parameter efficiency), or dynamic embedding re-initialization (active forgetting) to enhance cross-lingual transfer (Aggarwal et al., 21 Oct 2024).

Active Forgetting Mechanisms:

Aggarwal et al. demonstrate that decoder-only LLMs benefit from periodic re-initialization of token embeddings during pretraining, which prevents embedding overspecialization and yields lower perplexity and stronger cross-lingual transfer (Aggarwal et al., 21 Oct 2024).

4. Dynamics, Scaling Laws, and Evaluation

Pretraining Dynamics:

Monolingual in-language performance saturates quickly, while cross-lingual transferability emerges slowly and is language-pair dependent. High-resource and typologically similar pairs (e.g., English-Spanish) align earlier; low-resource or distant pairs (e.g., English-Arabic) require more iterations (Blevins et al., 2022).

Scaling Laws:

Continual pretraining from a multilingual base model to a new language delivers faster convergence and compute savings. Empirically, the loss follows: $L(N, D) = E + A N^{-\alpha} + B' D^{-\beta'} N^{-\gamma}$ where the N^{-\gamma} term accounts for data–parameter transfer effect. Optimal allocation shifts toward larger models and fewer new-language tokens (Zheng et al., 2 Jul 2024).

Method	Compute Savings	Scaling Law Modification	Reference
CLCP (CPT)	25–50%	Data–parameter joint term	(Zheng et al., 2 Jul 2024)
Script-diverse MLM	N/A	Nearly linear gain up to N=32	(Fujinuma et al., 2022)

Downstream Evaluations:

Zero-shot transfer, cross-lingual retrieval, NLI, NER, sequence labeling, summarization, and QA provide standard benchmarks. Recent works show consistent gains in top-10 retrieval, BLEU, and F1 from CLCP versus raw multilingual pretrained models (Yang et al., 2022, Zan et al., 2022, Chen et al., 2022).

5. Limitations and Objective-Centric Controversies

Supervised MT Objectives: Negative Impact on Cross-lingual Transfer:

Despite sharing an explicit alignment motivation with CLCP, continued pretraining on pure neural MT cross-entropy can degrade cross-lingual feature sharing essential for classification and sequence labeling, as opposed to translation. This is attributed to "output separability": representations become more language-specific rather than language-agnostic, harming downstream zero-shot generalization (Ji et al., 25 Mar 2024). Centered kernel alignment (CKA) analysis confirms that MT-CP reduces cross-language similarity, increasing norm and subspace separation.

Shallow Alignment and Conductivity:

Both mixed and continued pretraining boost shallow cross-lingual alignment (performance, simple consistency), but knowledge "conductivity" (the ability to answer synthetic cross-lingual factual queries) remains very low without dedicated alignment objectives (Gao et al., 6 Apr 2024).

Resource and Data Constraints:

Parallel corpora are scarce for many languages; reliance on monolingual or retrieval-augmented comparable data is prevalent. Furthermore, indiscriminate single-language continued pretraining may harm other languages’ performance in mixed LLMs (Gao et al., 6 Apr 2024).

6. Best Practices and Recommendations

Prioritize Script and Typological Diversity: When expanding to new languages, select typologically diverse training sets and apply uniform batch sampling (Fujinuma et al., 2022).
Interleave or Mix Pretraining Objectives: For large LLMs, mix self-supervised and supervised (e.g., MT) objectives via adaptive scheduling or curriculum learning; bandit algorithms (e.g., FAIR) can dynamically allocate compute for maximal transferability (Schioppa et al., 2023).
Leverage Alignment Losses with Modest Data: Lightweight contrastive or alignment-based continual pretraining, even with 0.05 ‰ of total tokens, can substantially boost cross-lingual consistency and representation overlap (Li et al., 2023, Chen et al., 2022).
Mitigate Catastrophic Forgetting: To protect preexisting language capabilities when applying target-language CPT, interleave 5%–30% source-language data during updates (Zheng et al., 2 Jul 2024).
Align Pretraining and Fine-tuning Objectives: Where possible, pair span extraction or input/output interfaces at pretrain and fine-tune time to accelerate convergence and maximize task transfer (e.g. [QUE] marker for QA) (Chen et al., 2022).
Beware Direct MT-only CP for Transfer Tasks: For cross-lingual transfer in NLU, do not rely solely on continued pure machine translation training—combine with regularization, shared embedding losses, or maintain some MLM/denoising steps (Ji et al., 25 Mar 2024).
Exploit Continual Learning Techniques: Apply gradient episodic memory or EWC frameworks to preserve pretraining objectives during downstream adaptation, especially in rapid fine-tuning or few-shot regimes (Liu et al., 2020).

7. Ongoing Research Directions

Factually-grounded Multilingual Objectives: Develop objectives that align representations of the same facts across languages, potentially via cross-lingual contrastive sentence-embedding or retrieval-augmented architectures (Gao et al., 6 Apr 2024).
Sliding-window and Semantic Retrieval Approaches: Refine in-context data construction for LLMs, including sliding-window policies and dynamic retrieval augmentation, to maximize context utilization and alignment (Wu et al., 29 Apr 2025).
Dynamic Pretraining Schedules and Layer Ensembles: Adjust language sampling and checkpoint selection strategies to exploit the differential emergence of in-language and cross-language signal; ensemble middle layers for robust cross-lingual transfer (Blevins et al., 2022).

Research in CLCP continues to refine objectives, corpus design, adaptation protocols, and evaluation to further enhance the depth and universality of multilingual models in the presence of significant resource and typological constraints. This ongoing exploration has established continuous pretraining as a critical tool for effective cross-lingual NLP.