GemmaX2-28-9B: Multilingual MT LLM
- GemmaX2-28-9B is a multilingual large language model designed for high-performance machine translation across 28 languages using a 9B-parameter decoder-only Transformer.
- It employs a novel Parallel-First Monolingual-Second (PFMS) data mixing strategy that allocates a fixed per-language token budget to maximize translation quality.
- Empirical evaluations reveal that GemmaX2-28-9B achieves competitive spBLEU and COMET scores, rivaling closed systems like Google Translate and GPT-4-turbo.
GemmaX2-28-9B is a multilingual LLM designed for high-performance machine translation (MT) across 28 languages. Developed via continual pretraining and instruction finetuning of the Gemma2-9B backbone, GemmaX2-28-9B establishes a new performance frontier for open-source LLMs under 10 billion parameters, rivaling closed-source systems such as Google Translate and GPT-4-turbo. Its principal innovation is the Parallel-First Monolingual-Second (PFMS) data mixing strategy, which systematically optimizes language-resource utilization—integrating both parallel and monolingual corpora according to a fixed per-language budget—to maximize translation quality across diverse language-resource tiers (Cui et al., 4 Feb 2025).
1. Model Architecture
GemmaX2-28-9B is based on a standard decoder-only Transformer architecture closely aligned with other “9 B-parameter” LLMs. The key architectural parameters are:
- Number of Transformer layers
- Hidden dimension
- Attention heads (per-head dimension )
- Feed-forward hidden size
- Vocabulary size subword tokens
- Total parameter count
Each layer incorporates a multi-head self-attention module followed by a two-layer MLP, with layer normalization and pre-activation residual connections. The self-attention for a given head is defined as
with all heads concatenated and projected via an output matrix . This canonical Transformer design supports scalable token-level modeling for multilingual machine translation (Cui et al., 4 Feb 2025).
2. Parallel-First Monolingual-Second (PFMS) Data Mixing Strategy
GemmaX2-28-9B introduces the PFMS data mixing protocol for continual pretraining. For each of the 28 supported languages, a per-language token budget is allocated, prioritized as follows:
- Let denote available (English-/Chinese-centric) parallel tokens for language .
- Assign parallel tokens.
- Assign supplementary monolingual tokens.
This ensures maximal exploitation of parallel data, with monolingual data filling any deficit. During pretraining, sentences from both sources are randomly interleaved at the token level. The learning objective remains next-token cross-entropy:
The PFMS strategy is empirically shown to yield superior COMET and BLEU gains by optimizing trade-offs between high-resource and low/mid-resource languages (Cui et al., 4 Feb 2025).
3. Pretraining and Instruction Finetuning
GemmaX2-28-9B training comprises two stages:
Continual Pretraining
- Hardware: 32× NVIDIA H800 GPUs; one epoch over PFMS-mixed data
- Effective batch size: 1.57M tokens (with gradient accumulation)
- Sequence length: 2048 tokens
- Optimizer: AdamW (=0.01 weight decay, max grad norm 1.0)
- Learning schedule: base LR (cosine decay, 1% linear warmup)
- Precision: bf16; ZeRO 2 optimizer
- Epochs: single pass over aggregated PFMS data
Instruction Finetuning
- Hardware: 8× NVIDIA H800 GPUs; one epoch over 196K curated translation pairs
- Effective batch: 32 sequences
- Learning rate schedule: inverse-square-root decay, peak LR , 1% warmup
- Input prompt format:
Both phases use standard language modeling loss and optimization protocols, ensuring stability and convergence (Cui et al., 4 Feb 2025).1 2 3
Translate this from [source language] to [target language]: [source sentence] [target sentence]
4. Training Data Composition and Resource Allocation
The PFMS regime draws from extensive corpora:
- Monolingual: CulturaX (6.3T tokens, 167 languages) and MADLAD-400 (3T tokens, 419 languages)
- Parallel: OPUS collection (as of August 2024), yielding English- and Chinese-centric sentence pairs, rigorously deduplicated, filtered, and LangID-checked, culminating in 3.4B pairs across 28 languages
Resource tiers (per Joshi et al.) are:
- High-resource (18 languages)
- Mid-resource (7 languages)
- Low-resource (3 languages)
PFMS ensures that, for each language, parallel data (up to 2B tokens) is prioritized and topped up with monolingual data as needed. This approach enables balanced model specialization and robust many-to-many transfer, particularly in low- and mid-resource settings (Cui et al., 4 Feb 2025).
5. Empirical Evaluation and Comparative Performance
Evaluation is conducted on WMT-24 (XCOMET, COMETKiwi) and FLORES-200 (spBLEU, COMET):
- Gemma2-9B (pre-continuation):
- WMT-24 (en→xx): 72.06 spBLEU / 67.05 COMET
- FLORES-200 aggregate: trails NLLB-54.5B by ≈4 spBLEU
- GemmaX2-28-9B (final):
- WMT-24 (en→xx): 79.37 spBLEU / 74.41 COMET
- FLORES-200:
- en→xx: 39.72 spBLEU / 88.35 COMET
- xx→en: 45.07 spBLEU / 88.95 COMET
- zh→xx: 27.48 spBLEU / 85.69 COMET
- xx→zh: 33.74 spBLEU / 87.38 COMET
GemmaX2-28-9B outperforms state-of-the-art open models (e.g., TowerInstruct-7/13B, X-ALMA, Aya-101, LLaMAX3-8B) by 2–5 BLEU on many directions and closely matches Google Translate and GPT-4-turbo (e.g., Google 77.64/73.00, GPT-4-turbo 79.35/75.40 on WMT-24 en→xx) (Cui et al., 4 Feb 2025).
6. Data Mixing Trade-Offs and Scaling
Performance analysis of alternative data mixing regimens reveals:
- Monolingual-only continual pretraining increases low/mid-resource translation but degrades high-resource (xx→en) performance.
- Parallel-only pretraining yields strong high-resource results but hinders low/mid-resource outcomes.
- Intermediate mixture ratios (mono:par ) reveal language-dependent performance peaks.
- PFMS is found to balance all axes: achieving best overall COMET on low/mid-resource pairs and near-optimal on high-resource.
No explicit scaling law is provided, but model scaling observations show that GemmaX2-28-2B (2B params) already outperforms several baselines, with performance scaling favorably up to 9B parameters (Cui et al., 4 Feb 2025).
7. Insights, Limitations, and Future Directions
Key empirical insights include:
- Careful blending of parallel and monolingual data during continual pretraining enables open LLMs under 10B parameters to close the performance gap with leading closed systems.
- The “Parallel-First” protocol yields consistent improvements across language-resource tiers.
Limitations:
- Experiments are restricted to model sizes below 10B parameters; scaling to larger architectures remains unexplored.
- The study focuses on English-centric and Chinese-centric directions; many-to-many (non-pivot) directions and broader language coverage warrant further investigation.
Prospective research avenues include:
- Applying PFMS to larger-scale models and more languages.
- Incorporating domain- or dialect-specific corpora for specialized translation.
- Developing adaptive, per-language data mixture schedules.
- Extending to document-level, multi-sentence translation tasks (Cui et al., 4 Feb 2025).