XLM-R XL: Scalable Multilingual Model

Updated 20 February 2026

XLM-R XL is a large-scale multilingual masked language model with 3.5 billion parameters that scales the XLM-R architecture to enhance representation across high- and low-resource languages.
The model employs robust pretraining on the CC100 corpus using temperature-based sampling and a masked language modeling objective, achieving state-of-the-art results on XNLI and GLUE benchmarks.
Increased model capacity leads to monotonic performance gains, particularly benefiting low-resource languages and illustrating the potential and limitations of scaling within fixed data regimes.

XLM-R XL is a large-scale multilingual masked LLM, introduced as a direct scale-up of the XLM-R architecture, with 3.5 billion parameters. Developed within the framework of cross-lingual LLM pretraining, XLM-R XL is designed to improve representation quality across both high- and low-resource languages by increasing model capacity and applying robust pretraining strategies on massive multilingual corpora. The model achieves state-of-the-art results on multiple cross-lingual and monolingual benchmarks and demonstrates monotonic improvements as model scale increases (Goyal et al., 2021).

1. Model Architecture

XLM-R XL extends the XLM-R design, retaining the core Transformer architecture and scaling its constituent dimensions:

Hyperparameter	Value
Transformer Layers ( $L$ )	36
Hidden Size ( $H$ )	2 560
Attention Heads ( $A$ )	32
Head Size ( $H/A$ )	80
Feedforward Inner Dim.	10 240
Vocabulary Size	250,000
Parameter Count	≈ 3.5B

All architectural hyperparameters not explicitly specified adhere to those of XLM-R (Conneau et al., 2019). XLM-R XL introduces pre-layer normalization throughout the network, a modification empirically shown to improve training stability at large scale. Unlike some multilingual models, the embedding size equals the hidden size and no adapter or projection layers are used beyond those in XLM-R.

2. Training Data and Pretraining Objective

Corpus and Sampling

XLM-R XL is pretrained on CC100, a web-crawled corpus spanning 167 billion tokens across 100 languages (Wenzek et al., 2019). The corpus is sampled using the same temperature-based strategy as XLM-R, with an exponent $\alpha = 0.3$ , to mediate the long-tail distribution of language resources.

Masked Language Modeling Objective

Training employs the standard multilingual masked language modeling (MLM) objective:

$\mathcal{L}_{MLM} = -\sum_{t \in M} \log P(x_t \mid x_{\setminus M})$

where $M$ is the set of masked token positions. The masking follows the BERT/XLM-R convention: 15% of tokens randomly selected for masking; of these, 80% replaced with [MASK], 10% with a random token, and 10% left unchanged.

3. Optimization and Hyperparameters

Pretraining Configuration

Batch Size: 2,048 sequences of length 512 (≈ 1.05M tokens per update)
Updates: 500,000 (≈ 0.5T tokens observed)
Optimizer: Adam ( $\beta_1 = 0.9$ , $\beta_2 = 0.98$ , $\epsilon = 1 \times 10^{-6}$ )
Learning Rate: Linear warmup (10,000 updates) to 1e-4 peak; linear decay thereafter
Model Parallelism: Tensor-parallel degree 2 (Megatron-style)

Fine-tuning Strategy

Batch Size: 32
Epochs: Up to 10, with early stopping on average dev metric

Training was executed on a multi-GPU cluster, requiring several weeks of wall-clock time for XLM-R XL; the even larger XLM-R XXL utilized a tensor-parallel degree of 8.

4. Evaluation Results

Cross-Lingual Natural Language Inference (XNLI)

Performance is reported as average accuracy across 15 languages under zero-shot transfer (English-only fine-tuning) and translate-train-all (multi-language via translation):

Model	Params	Zero-Shot Avg	Translate-All Avg
XLM-R Base	270M	76.2	79.1
XLM-R Large	550M	80.9	83.6
XLM-R XL	3.5B	82.3	85.4
XLM-R XXL	10.7B	83.1	86.0

XLM-R XL surpasses XLM-R Base by +6.1 (zero-shot) and +6.3 (translate-all) points.
XLM-R XL provides a +1.8% absolute average accuracy improvement over XLM-R Large on “translate-all.”

English GLUE Benchmark

Model	#Langs	Avg Accuracy
RoBERTa-Large (mono)	1	92.9
XLM-R Large	100	91.9
XLM-R XL	100	93.0
XLM-R XXL	100	93.2

XLM-R XL outperforms RoBERTa-Large on several English GLUE tasks by +0.1 point on average while representing 99 additional languages; XLM-R XXL improves by +0.3 points.

Resource-Specific Gains

Improvements from Base→XL→XXL are especially pronounced for low-resource languages (e.g., Swahili, Urdu, Hindi) across XNLI, MLQA, and XQuAD, compared to high-resource languages.

5. Scaling Behavior and Observations

Monotonic Improvements: Every benchmark displays linear performance gains with increasing model size from Base to XXL.
Low-Resource Enhancement: Increased capacity allows allocation of more parameters for modeling scarce features unique to low-resource languages, enhancing zero-shot transfer.
Saturation Consideration: The XL→XXL gain for XLM-R (0.6–0.7 percentage points) is smaller than for mT5, suggesting possible data saturation on CC100; more data or generative pretraining may further improve scaling.

A plausible implication is that scaling model capacity yields diminishing returns when corpus size is fixed, indicating data quantity or task diversity as next scaling frontiers.

6. Practical Aspects and Availability

Code & Models: Publicly available in Fairseq’s examples/xlmr repository.
Compute: Training leveraged Megatron-style tensor parallelism (degree 2 for XL, 8 for XXL), with wall-clock durations of a few weeks per model.
Fine-tuning Protocols: Standard procedures using small batch sizes, 10-epoch limits, and early stopping.

7. Relation to Prior Work

XLM-R: Builds directly upon XLM-R Base (Conneau et al., 2019) for architectural and training protocols.
CC100: Follows data creation and quality control from Wenzek et al. (2019).
Megatron-LM: Utilizes tensor parallelism as described by Shoeybi et al.
mT5, RoBERTa: Benchmarks and scaling observations are contextualized using results from Raffel et al. (mT5) and Liu et al. (RoBERTa).

XLM-R XL demonstrates that expanding multilingual model capacity coupled with curated large-scale corpora enables improved cross-lingual and monolingual transfer, with amplified benefits for low-resource languages. The consistent scaling gains motivate exploration of further increases in both parameter count and training data diversity (Goyal et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

Larger-Scale Transformers for Multilingual Masked Language Modeling (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to XLM-R XL.