Papers
Topics
Authors
Recent
2000 character limit reached

IberianLLM-7B-Instruct: Iberian Translation LLM

Updated 18 December 2025
  • IberianLLM-7B-Instruct is a specialized 7B-parameter, instruction-tuned language model designed for high-fidelity translation and reasoning across major Iberian languages.
  • It utilizes a robust Mistral-7B Transformer with 32 layers and a dedicated three-stage Seed-X pre-training process that rigorously filters data for quality and linguistic diversity.
  • Reinforcement learning via PPO and fine-tuning with human-curated instruction enhance its performance, yielding superior BLEU, COMET, and BLEURT scores on benchmark tests.

IberianLLM-7B-Instruct is a 7-billion-parameter translation-oriented instruction-following LLM, architected for high-fidelity machine translation and reasoning in the principal Iberian languages: Spanish, Portuguese, Catalan, and Galician. Developed following the Seed-X family blueprint, it emphasizes both linguistic specialization and methodological rigor, integrating contemporary LLM engineering with distinct data and tuning strategies tailored to the linguistic and dialectal features of the Iberian domain (Cheng et al., 18 Jul 2025).

1. Model Architecture

IberianLLM-7B-Instruct is built upon the Mistral-7B decoder-only Transformer backbone, inheriting its configuration with 32 layers, a hidden size (dmodeld_\mathrm{model}) of 4096, a feed-forward network dimension (dffd_\mathrm{ff}) of 14,336, and 32 attention heads. The vocabulary is defined by a byte-pair encoding (BPE) scheme with 65,269 tokens. Rotary positional embeddings (RoPE) are used, and the maximum sequence length is set to 2048.

Component Value
Transformer layers 32
Hidden size 4096
FFN size 14,336
Attention heads 32
Vocabulary (BPE) 65,269
Positional encoding Rotary (RoPE)
Max sequence length 2048

This configuration yields approximately 7 billion parameters. The selection of these architectural specifications is consistent with recent empirical findings favoring deep, high-capacity models trained with efficient data and hardware utilization (Cheng et al., 18 Jul 2025).

2. Pre-training Data and Schedule

The pre-training protocol adopts the three-stage progression delineated in the Seed-X methodology, systematically optimizing the monolingual and bilingual data mix to prioritize Iberian language coverage and translation quality.

  • Stage 1 (S1: Core Iberian): 70% monolingual (primarily Spanish and Portuguese, with Catalan and Galician secondary), 30% English and other high-resource languages. Monolingual allocation: Spanish (30%), Portuguese (20%), Catalan (15%), Galician (5%), others (30%).
  • Stage 2 (S2: Multilingual Iberian): 50% monolingual (all four Iberian languages), 50% bilingual, with a focus on key language pairs: es↔pt, es↔ca, es↔gl, pt↔ca, pt↔gl.
  • Stage 3 (S3: Parallel-Only Iberian): Exclusively high-quality parallel data for Iberian pairs.

For both monolingual and parallel corpora, document tiering is enforced using a learned classifier rating each instance as high, medium, or low quality. High-tier data are directly retained, medium-tier data are paraphrased using an LLM, and low-tier data are discarded. Parallel data inclusion is further restricted to those with word-alignment confidence exceeding 0.9 and language identification confidence above 0.99. Iterative re-writing via back-translation and paraphrasing is deployed to maximize data quality and diversity (Cheng et al., 18 Jul 2025).

3. Objectives, Instruction Tuning, and Reinforcement Learning

Pre-training Objective

The model is optimized with next-token cross-entropy across concatenated monolingual and bilingual data:

LML(θ)=t=1Tlogpθ(wtw<t)\mathcal{L}_{\mathrm{ML}}(\theta) = -\sum_{t=1}^T \log p_\theta(w_t\mid w_{<t})

Key optimization hyperparameters include a warm-up of 2k steps to a peak LR of 3×1043\times10^{-4}, cosine decay to 0.1×peak, and 2M tokens per batch.

Supervised Instruction Fine-Tuning (SFT)

Instruction tuning leverages approximately 200k human-curated Iberian translation instructions, integrating FLORES-like development sets and domain-tagged sentence pairs. Templates range from standard prompts (“Translate from <src> to <trg>: …”) to chain-of-thought (CoT) formats (“First translate the text, then explain your choices step by step”), eliciting both direct translations and stepwise linguistic rationales.

Parameter Value
Batch size 64 sentences
Learning rate 3×1063\times10^{-6}
Warm-up steps 1000
Total steps 30,000
Optimizer AdamW (β₁=0.9, β₂=0.999, ϵ=1e–8)

Reinforcement Learning via PPO

A Proximal Policy Optimization (PPO) stage further enhances translation output quality. Rewards are composite: a human preference model trained on ≈20k annotated Iberian translation comparisons, and a dual-based reward (DuPO) leveraging back-translation and similarity scoring (e.g., COMET). PPO’s objective integrates a KL-regularization penalty as follows:

LPPO(θ)=Eaπθ[r(s,a)logπθ(as)]+βKL[πrefπθ]\mathcal{L}_{\mathrm{PPO}}(\theta) = -\mathbb{E}_{a\sim\pi_\theta}\left[r(s,a)\,\log\pi_\theta(a\mid s)\right] + \beta\,\mathrm{KL}\left[\pi_{\mathrm{ref}}\,\|\,\pi_\theta\right]

Key PPO configuration: KL penalty coefficient β = 0.1, learning rate 1×1061\times10^{-6}, batch size 512, 4 epochs per batch, 16 rollouts per query.

4. Evaluation Protocols and Results

Evaluation encompasses both automatic and human assessments:

  • Automatic Metrics: BLEU, COMET, BLEURT
  • Benchmarks: FLORES-200 (EN↔ES, ES↔PT, ES↔CA, ES↔GL), WMT 2023 test sets for Spanish, Portuguese, and Catalan.
  • Human Evaluation: 0–4 scale targeting accuracy, fluency, and idiomaticity, with annotation by native speakers in each target language.
Model BLEU COMET BLEURT
SFT only 43.2 89.1 72.3
+ PPO (reward RL) 45.0 90.5 74.0
Direction SFT only + PPO
EN→ES 3.60 3.78
ES→PT 3.48 3.65
ES→CA 3.20 3.45
ES→GL 3.05 3.30

These results indicate that PPO yields gains across both automated and human metrics, with particular benefit in less-resourced pairs such as ES→GL and ES→CA (Cheng et al., 18 Jul 2025).

5. Data Strategy and Specialization for Iberian Languages

Data sourcing employs CommonCrawl, CCNet, and Wikipedia for Spanish/Portuguese, and leverages OPUS corpora (notably OpenSubtitles), Catalan/Galician Wikipedia, and regional news archives for Catalan and Galician. Domain balance is maintained between news, government/legal, medical, tourism, e-commerce, social media, and literature. To mitigate token underrepresentation, Catalan and Galician samples are oversampled by a factor of 2–4, targeting at least 5% token share each.

Language tags (<ES>,<PT>,<CA>,<GL>) prefix sentences in bilingual concatenations, which may facilitate explicit language conditioning. Potential pitfalls include orthographic variance (e.g., Galician “x” vs Spanish “j”), code-switching, and dialectal mismatches (e.g., European vs Latin American Spanish), with protocols for filtering or tagging accordingly. Data drift between large-scale pre-training (web and Wikipedia) and domain-specific fine-tuning is a recognized risk (Cheng et al., 18 Jul 2025).

6. Computational Considerations and Optimization Practices

Pre-training requires approximately 1,200 A100-GPU days with 2M-token batches. Instruction tuning proceeds on 8×A100 for two days, while PPO reinforcement learning occupies 16×A100 for three days. Optimization for efficiency includes mixed precision (FP16), gradient checkpointing, sharded optimizer states (ZeRO-1/2), and dynamic batch sizing by sequence length.

An iterative protocol governs development: S1 monolingual pre-training and validation, S2 bilingual supplement and FLORES metric monitoring, S3 parallel data until BLEURT stabilization, SFT prompt type ablation, PPO with KL tuning, and stagewise human spot-checks and data rebalance. Early stopping is recommended to prevent catastrophic forgetting, especially after parallel-only training (Cheng et al., 18 Jul 2025).

7. Implications and Best Practices

This methodological framework enables the reproducible construction of a robust 7B-parameter LLM specialized for Iberian translation, attaining performance on par with larger, generalist models in FLORES/WMT benchmarks and native speaker ratings. Best practices involve early and aggressive filtering, strategic oversampling of lower-resource languages, stagewise ablation, and continuous human quality control.

A plausible implication is that such specialized regional models, leveraging targeted data, structural adaptation, and stagewise reinforcement learning, can rival much larger models in their intended application domains, with greater data efficiency and adaptation potential (Cheng et al., 18 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to IberianLLM-7B-Instruct Model.