This paper addresses the challenge of building high-quality machine translation (MT) models for specific language varieties that are often underrepresented in general-purpose translation systems, using European Portuguese (EP) as a case paper. The authors introduce "Tradutor," the first open-source translation model specifically for English to European Portuguese, and accompany it with a novel, large-scale parallel dataset called PTradutor.
The core problem is that while languages like Portuguese have many speakers, the vast majority of available data is in the dominant variety (Brazilian Portuguese), leading to sub-optimal performance for other varieties like European Portuguese in downstream NLP tasks. The proposed solution is to create a variety-specific MT model.
To overcome the data scarcity for European Portuguese, the authors propose a retro-translation methodology to automatically generate a parallel corpus:
- Monolingual Corpus Collection: They gathered a large collection of texts specifically in European Portuguese from existing datasets like DSL-TL (Abdin et al., 22 Apr 2024 ) and PtBrVid (Sousa et al., 20 Feb 2025 ).
- Translation: Using an off-the-shelf translation system (Google Translate), they translated the European Portuguese texts into English. The rationale is that translation into a resource-rich language (English) is generally higher quality than translation into a low-resource variety. An experiment confirmed that translating EP and BP texts to English using Google Translate resulted in highly similar English outputs (87.2% identical, 96.8% BLEU), suggesting minimal contamination from the variety difference in this step.
- Filtering: A rigorous filtering pipeline was applied to ensure data quality. This involved removing boilerplate content using jusText, eliminating duplicates, cleaning invalid characters and repetitive patterns, and filtering out documents exceeding a token limit (900 tokens combined source/target using LLaMA-3 tokenizer) to manage training efficiency and fit within standard context windows (1024 tokens). The filtering significantly reduced the dataset size, particularly removing low-quality social media content.
This process resulted in the PTradutor corpus, a parallel English-European Portuguese dataset comprising over 1.7 million documents and 293 million Portuguese tokens, which is publicly available on Hugging Face.
For building the translation model, the authors fine-tuned pre-trained instruction-following LLMs (LMs) on the PTradutor corpus. They frame the translation task as a causal LLMing problem, prompting the LM to generate the European Portuguese translation given the English text. They experimented with:
- Models: Gemma-2 (2B parameters), Phi-3-mini (3.8B parameters), and LLaMA-3 (8B parameters), all using their instruction-tuned variants.
- Training Approaches:
Implementation details include training on A100 GPUs, using libraries like torchtune and transformers, and specific hyperparameters (learning rate 2e-5, weight decay 0.1) and batch sizes (varying from 256 to 512 depending on the model and approach). Early stopping was used based on validation performance.
Evaluation was conducted on two European Portuguese benchmarks: FRMT (Solleveld, 2023 ) and NTrex (Ramakrishnan et al., 2022 ). They used a suite of metrics:
- N-gram based: BLEU [0208040], ROUGE-L (Joshi et al., 11 Jan 2024 ) to measure lexical overlap.
- Learnable: COMET (Benjamin et al., 2020 ) for a more nuanced quality assessment (reference-based Direct Assessment variant).
- Language Variety: A custom VID score calculated using a Portuguese variety classifier (Sousa et al., 20 Feb 2025 ). This metric measures the ratio of system outputs classified as European Portuguese compared to the reference translations classified as EP, indicating how well the system adheres to the target variety.
The results show that fully fine-tuned models significantly outperform zero-shot baselines and existing open-source generic Portuguese translation systems (ArgosTranslate, Opus-MT) on standard translation metrics (BLEU, ROUGE-L, COMET) and, critically, on the VID score, demonstrating better adherence to the European Portuguese variety. The LLaMA-3 (8B) model achieved the best performance among open-source models, with results on par with Google Translate's generic Portuguese model and approaching the performance of variety-specific closed-source systems like Google and DeepL, particularly in terms of VID score.
The LoRA models achieved high VID scores, indicating they learned the vocabulary and nuances of EP, but struggled with overall translation quality, sometimes entering repetitive generation loops, suggesting that for medium-sized models, increasing adapter capacity might be necessary.
The paper demonstrates that the proposed data generation methodology and fine-tuning approach are effective for adapting smaller LMs to generate text in specific, underrepresented language varieties, offering a computationally efficient alternative to training large models from scratch or relying on generic systems. The release of the PTradutor dataset, models, and code contributes valuable resources to the community for further research in this area. Future work includes exploring different generation strategies (like beam search), optimizing prompts, and conducting human evaluations by linguists.