GemmaX2-28-9B: Multilingual MT LLM

Updated 18 January 2026

GemmaX2-28-9B is a multilingual large language model designed for high-performance machine translation across 28 languages using a 9B-parameter decoder-only Transformer.
It employs a novel Parallel-First Monolingual-Second (PFMS) data mixing strategy that allocates a fixed per-language token budget to maximize translation quality.
Empirical evaluations reveal that GemmaX2-28-9B achieves competitive spBLEU and COMET scores, rivaling closed systems like Google Translate and GPT-4-turbo.

GemmaX2-28-9B is a multilingual LLM designed for high-performance machine translation (MT) across 28 languages. Developed via continual pretraining and instruction finetuning of the Gemma2-9B backbone, GemmaX2-28-9B establishes a new performance frontier for open-source LLMs under 10 billion parameters, rivaling closed-source systems such as Google Translate and GPT-4-turbo. Its principal innovation is the Parallel-First Monolingual-Second (PFMS) data mixing strategy, which systematically optimizes language-resource utilization—integrating both parallel and monolingual corpora according to a fixed per-language budget—to maximize translation quality across diverse language-resource tiers (Cui et al., 4 Feb 2025).

1. Model Architecture

GemmaX2-28-9B is based on a standard decoder-only Transformer architecture closely aligned with other “9 B-parameter” LLMs. The key architectural parameters are:

Number of Transformer layers $L = 28$
Hidden dimension $H \approx 4096$
Attention heads $A = 32$ (per-head dimension $H/A \approx 128$ )
Feed-forward hidden size $D_{ff} \approx 16\,384$
Vocabulary size $|V| \approx 520\,000$ subword tokens
Total parameter count $\simeq 9.0 \times 10^9$

Each layer incorporates a multi-head self-attention module followed by a two-layer MLP, with layer normalization and pre-activation residual connections. The self-attention for a given head is defined as

$\alpha_{t,i} = \mathrm{softmax}\bigl((QK^T)/\sqrt{d_k}\bigr) \quad \text{and} \quad \mathrm{head}_t = \alpha_{t} \cdot V$

with all heads concatenated and projected via an output matrix $W_O$ . This canonical Transformer design supports scalable token-level modeling for multilingual machine translation (Cui et al., 4 Feb 2025).

2. Parallel-First Monolingual-Second (PFMS) Data Mixing Strategy

GemmaX2-28-9B introduces the PFMS data mixing protocol for continual pretraining. For each of the 28 supported languages, a per-language token budget $T=2\times 10^9$ is allocated, prioritized as follows:

Let $|D_\ell^{\mathrm{par}}|$ denote available (English-/Chinese-centric) parallel tokens for language $\ell$ .
Assign $P_\ell = \min(|D_\ell^{\mathrm{par}}|, T)$ parallel tokens.
Assign $M_\ell = T - P_\ell$ supplementary monolingual tokens.

This ensures maximal exploitation of parallel data, with monolingual data filling any deficit. During pretraining, sentences from both sources are randomly interleaved at the token level. The learning objective remains next-token cross-entropy:

$\mathcal{L}(\theta) = - \sum_{(x,y) \in D} \sum_{t=1}^{|y|} \log p_\theta(y_t \mid y_{<t}, x)$

The PFMS strategy is empirically shown to yield superior COMET and BLEU gains by optimizing trade-offs between high-resource and low/mid-resource languages (Cui et al., 4 Feb 2025).

3. Pretraining and Instruction Finetuning

GemmaX2-28-9B training comprises two stages:

Continual Pretraining

Hardware: 32× NVIDIA H800 GPUs; one epoch over PFMS-mixed data
Effective batch size: 1.57M tokens (with gradient accumulation)
Sequence length: 2048 tokens
Optimizer: AdamW ( $\lambda$ =0.01 weight decay, max grad norm 1.0)
Learning schedule: base LR $2 \times 10^{-5}$ (cosine decay, 1% linear warmup)
Precision: bf16; ZeRO 2 optimizer
Epochs: single pass over aggregated PFMS data

Instruction Finetuning

Hardware: 8× NVIDIA H800 GPUs; one epoch over $\approx$ 196K curated translation pairs
Effective batch: 32 sequences
Learning rate schedule: inverse-square-root decay, peak LR $2\times 10^{-5}$ , 1% warmup
Input prompt format:
1 2 3
Translate this from [source language] to [target language]: [source sentence] [target sentence]
Both phases use standard language modeling loss and optimization protocols, ensuring stability and convergence (Cui et al., 4 Feb 2025).

4. Training Data Composition and Resource Allocation

The PFMS regime draws from extensive corpora:

Monolingual: CulturaX (6.3T tokens, 167 languages) and MADLAD-400 (3T tokens, 419 languages)
Parallel: OPUS collection (as of August 2024), yielding English- and Chinese-centric sentence pairs, rigorously deduplicated, filtered, and LangID-checked, culminating in $\approx$ 3.4B pairs across 28 languages

Resource tiers (per Joshi et al.) are:

High-resource (18 languages)
Mid-resource (7 languages)
Low-resource (3 languages)

PFMS ensures that, for each language, parallel data (up to 2B tokens) is prioritized and topped up with monolingual data as needed. This approach enables balanced model specialization and robust many-to-many transfer, particularly in low- and mid-resource settings (Cui et al., 4 Feb 2025).

5. Empirical Evaluation and Comparative Performance

Evaluation is conducted on WMT-24 (XCOMET, COMETKiwi) and FLORES-200 (spBLEU, COMET):

Gemma2-9B (pre-continuation):
- WMT-24 (en→xx): 72.06 spBLEU / 67.05 COMET
- FLORES-200 aggregate: trails NLLB-54.5B by ≈4 spBLEU
GemmaX2-28-9B (final):
- WMT-24 (en→xx): 79.37 spBLEU / 74.41 COMET
- FLORES-200:
- en→xx: 39.72 spBLEU / 88.35 COMET
- xx→en: 45.07 spBLEU / 88.95 COMET
- zh→xx: 27.48 spBLEU / 85.69 COMET
- xx→zh: 33.74 spBLEU / 87.38 COMET

GemmaX2-28-9B outperforms state-of-the-art open models (e.g., TowerInstruct-7/13B, X-ALMA, Aya-101, LLaMAX3-8B) by 2–5 BLEU on many directions and closely matches Google Translate and GPT-4-turbo (e.g., Google 77.64/73.00, GPT-4-turbo 79.35/75.40 on WMT-24 en→xx) (Cui et al., 4 Feb 2025).

6. Data Mixing Trade-Offs and Scaling

Performance analysis of alternative data mixing regimens reveals:

Monolingual-only continual pretraining increases low/mid-resource translation but degrades high-resource (xx→en) performance.
Parallel-only pretraining yields strong high-resource results but hinders low/mid-resource outcomes.
Intermediate mixture ratios (mono:par $\in \{2:1, 1:1, 1:2\}$ ) reveal language-dependent performance peaks.
PFMS is found to balance all axes: achieving best overall COMET on low/mid-resource pairs and near-optimal on high-resource.

No explicit scaling law is provided, but model scaling observations show that GemmaX2-28-2B (2B params) already outperforms several baselines, with performance scaling favorably up to 9B parameters (Cui et al., 4 Feb 2025).

7. Insights, Limitations, and Future Directions

Key empirical insights include:

Careful blending of parallel and monolingual data during continual pretraining enables open LLMs under 10B parameters to close the performance gap with leading closed systems.
The “Parallel-First” protocol yields consistent improvements across language-resource tiers.

Limitations:

Experiments are restricted to model sizes below 10B parameters; scaling to larger architectures remains unexplored.
The study focuses on English-centric and Chinese-centric directions; many-to-many (non-pivot) directions and broader language coverage warrant further investigation.

Prospective research avenues include:

Applying PFMS to larger-scale models and more languages.
Incorporating domain- or dialect-specific corpora for specialized translation.
Developing adaptive, per-language data mixture schedules.
Extending to document-level, multi-sentence translation tasks (Cui et al., 4 Feb 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Multilingual Machine Translation with Open Large Language Models at Practical Scale: An Empirical Study (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GemmaX2-28-9B.