Tradutor: Building a Variety Specific Translation Model (2502.14385v1)

Published 20 Feb 2025 in cs.CL

Abstract: LLMs have become foundational to many widely used systems. However, these seemingly advantageous models are double-edged swords. While they excel in tasks related to resource-rich languages like English, they often lose the fine nuances of language forms, dialects, and varieties that are inherent to languages spoken in multiple regions of the world. Languages like European Portuguese are neglected in favor of their more popular counterpart, Brazilian Portuguese, leading to suboptimal performance in various linguistic tasks. To address this gap, we introduce the first open-source translation model specifically tailored for European Portuguese, along with a novel dataset specifically designed for this task. Results from automatic evaluations on two benchmark datasets demonstrate that our best model surpasses existing open-source translation systems for Portuguese and approaches the performance of industry-leading closed-source systems for European Portuguese. By making our dataset, models, and code publicly available, we aim to support and encourage further research, fostering advancements in the representation of underrepresented language varieties.

PDF Abstract

This paper addresses the challenge of building high-quality machine translation (MT) models for specific language varieties that are often underrepresented in general-purpose translation systems, using European Portuguese (EP) as a case paper. The authors introduce "Tradutor," the first open-source translation model specifically for English to European Portuguese, and accompany it with a novel, large-scale parallel dataset called PTradutor.

The core problem is that while languages like Portuguese have many speakers, the vast majority of available data is in the dominant variety (Brazilian Portuguese), leading to sub-optimal performance for other varieties like European Portuguese in downstream NLP tasks. The proposed solution is to create a variety-specific MT model.

To overcome the data scarcity for European Portuguese, the authors propose a retro-translation methodology to automatically generate a parallel corpus:

Monolingual Corpus Collection: They gathered a large collection of texts specifically in European Portuguese from existing datasets like DSL-TL (Abdin et al., 22 Apr 2024 ) and PtBrVid (Sousa et al., 20 Feb 2025 ).
Translation: Using an off-the-shelf translation system (Google Translate), they translated the European Portuguese texts into English. The rationale is that translation into a resource-rich language (English) is generally higher quality than translation into a low-resource variety. An experiment confirmed that translating EP and BP texts to English using Google Translate resulted in highly similar English outputs (87.2% identical, 96.8% BLEU), suggesting minimal contamination from the variety difference in this step.
Filtering: A rigorous filtering pipeline was applied to ensure data quality. This involved removing boilerplate content using jusText, eliminating duplicates, cleaning invalid characters and repetitive patterns, and filtering out documents exceeding a token limit (900 tokens combined source/target using LLaMA-3 tokenizer) to manage training efficiency and fit within standard context windows (1024 tokens). The filtering significantly reduced the dataset size, particularly removing low-quality social media content.

This process resulted in the PTradutor corpus, a parallel English-European Portuguese dataset comprising over 1.7 million documents and 293 million Portuguese tokens, which is publicly available on Hugging Face.

For building the translation model, the authors fine-tuned pre-trained instruction-following LLMs (LMs) on the PTradutor corpus. They frame the translation task as a causal LLMing problem, prompting the LM to generate the European Portuguese translation given the English text. They experimented with:

Models: Gemma-2 (2B parameters), Phi-3-mini (3.8B parameters), and LLaMA-3 (8B parameters), all using their instruction-tuned variants.
Training Approaches:
- Full Fine-Tuning (FFT): Updating all model parameters.
- Parameter Efficient Fine-Tuning (PEFT): Using LoRA (Lam, 2022 ) by injecting and training low-rank matrices, specifically with an alpha of 128 and a rank of 256.

Implementation details include training on A100 GPUs, using libraries like torchtune and transformers, and specific hyperparameters (learning rate 2e-5, weight decay 0.1) and batch sizes (varying from 256 to 512 depending on the model and approach). Early stopping was used based on validation performance.

Evaluation was conducted on two European Portuguese benchmarks: FRMT (Solleveld, 2023 ) and NTrex (Ramakrishnan et al., 2022 ). They used a suite of metrics:

N-gram based: BLEU [0208040], ROUGE-L (Joshi et al., 11 Jan 2024 ) to measure lexical overlap.
Learnable: COMET (Benjamin et al., 2020 ) for a more nuanced quality assessment (reference-based Direct Assessment variant).
Language Variety: A custom VID score calculated using a Portuguese variety classifier (Sousa et al., 20 Feb 2025 ). This metric measures the ratio of system outputs classified as European Portuguese compared to the reference translations classified as EP, indicating how well the system adheres to the target variety.

The results show that fully fine-tuned models significantly outperform zero-shot baselines and existing open-source generic Portuguese translation systems (ArgosTranslate, Opus-MT) on standard translation metrics (BLEU, ROUGE-L, COMET) and, critically, on the VID score, demonstrating better adherence to the European Portuguese variety. The LLaMA-3 (8B) model achieved the best performance among open-source models, with results on par with Google Translate's generic Portuguese model and approaching the performance of variety-specific closed-source systems like Google $_{pt}$ and DeepL, particularly in terms of VID score.

The LoRA models achieved high VID scores, indicating they learned the vocabulary and nuances of EP, but struggled with overall translation quality, sometimes entering repetitive generation loops, suggesting that for medium-sized models, increasing adapter capacity might be necessary.

The paper demonstrates that the proposed data generation methodology and fine-tuning approach are effective for adapting smaller LMs to generate text in specific, underrepresented language varieties, offering a computationally efficient alternative to training large models from scratch or relying on generic systems. The release of the PTradutor dataset, models, and code contributes valuable resources to the community for further research in this area. Future work includes exploring different generation strategies (like beam search), optimizing prompts, and conducting human evaluations by linguists.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Hugo Sousa (7 papers)
Satya Almasian (7 papers)
Ricardo Campos (30 papers)
Alípio Jorge (19 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/Bewickwren/status/1898043327250919610