Amadeus-Verbo: LLMs for Brazilian Portuguese
- Amadeus-Verbo is a family of decoder-only Transformer LLMs developed for Brazilian Portuguese using the Qwen2.5 architecture and full-parameter fine-tuning.
- The models are trained on over 600K instruction–output pairs with rigorous preprocessing and SLERP-based checkpoint merging to adapt language nuances.
- Benchmark results show that Amadeus-Verbo achieves parity or gains over multilingual baselines, while offering efficient deployment across diverse GPU configurations.
Amadeus-Verbo is a family of LLMs specifically developed for Brazilian Portuguese, extending the Qwen2.5 decoder-only Transformer architecture. The collection encompasses seven parameter scales and multiple variants targeting a spectrum of natural language processing tasks, with checkpoints openly provided for research and deployment. The development process focuses on full-parameter supervised fine-tuning and checkpoint merging, maintaining architectural consistency with Qwen2.5 while adapting model behavior to the linguistic and stylistic properties of Brazilian Portuguese (Cruz-Castañeda et al., 20 May 2025).
1. Model Architecture and Variants
Amadeus-Verbo models are built upon the Qwen2.5 decoder-only Transformer, preserving its architectural primitives:
- Rotary position embeddings for learning positional relationships.
- Causal self-attention blocks for autoregressive modeling.
- GeLU activations in MLP layers.
- Pre-layernorm (LayerNorm before both attention and MLP).
- Mixed-precision (bfloat16) weights during fine-tuning.
The suite offers three primary variants (base-tuned, instruction-tuned, and merged-instruct) at the following scales:
| Parameter Count | Layers (approx.) | Hidden Dim. (approx.) | Heads (approx.) |
|---|---|---|---|
| 0.5B | 8–16 | 512–1,024 | 8–16 |
| 1.5B | 8–16 | 512–1,024 | 8–16 |
| 3B | 24–32 | 2,048–4,096 | 32 |
| 7B | 24–32 | 2,048–4,096 | 32 |
| 14B | 40–60 | 5,632–8,192 | 48–64 |
| 32B | 40–60 | 5,632–8,192 | 48–64 |
| 72B | 64+ | 9,216 | 96 |
No modifications are made to architectural blocks for Brazilian Portuguese adaptation; instead, models undergo full-parameter supervised fine-tuning on an extensive corpus of Portuguese instructions. The final checkpoints maintain the expressive power of the original Qwen2.5, with instruction-following and generation style tuned to Brazilian Portuguese (Cruz-Castañeda et al., 20 May 2025).
2. Data, Preprocessing, and Training
The instruction-tuning process employs a corpus of approximately 600,000 Portuguese instruction–output pairs, with a high-quality subset of 78,840 instances used for supervised fine-tuning. Each instance includes an instruction, input (optionally), output, and full prompt text embedding these components with the “Resposta:” cue.
Preprocessing involves JSON-to-HuggingFace Datasets format conversion, Qwen2.5 tokenizer application (truncating/padding to max_length=8192), and filtering for quality and deduplication.
Fine-tuning utilizes full-parameter gradient-based optimization (no LoRA or prefix tuning), under Swift (with model-parallelism, DDP, ZeRO) and HuggingFace Transformers/Accelerate/Datasets. Hyperparameters are unified across model sizes: batch_size_per_device=1, 2 epochs, learning_rate=1e-5, AdamW optimizer (β₁=0.9, β₂=0.95, ε=1e-8), cosine lr_scheduler with warmup_ratio=0.05, weight_decay=0.01, max_grad_norm=1.0, and gradient_checkpointing enabled.
The training objective is next-token cross-entropy:
where is the one-hot ground truth and the predicted token softmax probability. Stability is ensured through weight decay, gradient clipping (to ), bfloat16, and activation checkpointing (Cruz-Castañeda et al., 20 May 2025).
3. Checkpoint Merging and Instruction Tuning via SLERP
Amadeus-Verbo leverages Spherical Linear Interpolation (SLERP) to merge checkpoints, utilizing the mergekit toolkit. The SLERP procedure interpolates between “base-instruct” and “instruct-fine-tuned” weights, layer-wise, controlled by merge factor :
- For parameter tensors and :
Layer-specific merge factors are configured: self-attention layers (), MLP layers (), others (). The result is the “MI” (merged instruct) checkpoints, e.g., Amadeus-Verbo-MI-Qwen2.5-X B-PT-BR-Instruct (Cruz-Castañeda et al., 20 May 2025).
4. Benchmark Evaluation and Comparative Results
Evaluation employs a natively Portuguese adaptation of the EleutherAI LM Evaluation Harness, spanning nine tasks:
- assin2_rte (F1 Macro)
- assin2_sts (Pearson)
- bluex (reading comprehension, F1 Macro)
- enem (ENEM exam, Accuracy)
- faquad_nli (F1 Macro)
- hatebr (F1 Macro)
- hate_speech (F1 Macro)
- tweetsentbr (F1 Macro)
- oab_exams (Accuracy)
Testing is performed in few-shot (3–25 exemplars per prompt) regimes on 8×A100 or H200 GPUs.
Across all model sizes, Amadeus-Verbo’s base-instruct (BI), fine-tuned instruct (FI), and merged instruct (MI) variants consistently match or surpass the performance of vanilla Qwen2.5-Instruct. For the 14B model, key results include: assin2_rte (0.95 F1 Macro, equal to baseline); faquad_nli (0.83 F1 Macro, +0.03 over baseline); enem (0.81 Accuracy, matched baseline). At the 32B and 72B scales, improvements reach +0.04 in Pearson or F1 on STS and NLI tasks (Cruz-Castañeda et al., 20 May 2025).
5. Deployment, Accessibility, and Inference Efficiency
All Amadeus-Verbo models (BI, FI, MI at seven sizes) are distributed via HuggingFace Hub (https://huggingface.co/collections/amadeusai/amadeus-verbo-qwen25-67cf2e7aae69ce2b3bcdcfda). Model repositories provide standard artifacts: config files, tokenizer assets, PyTorch or Safetensors model weights, and README with usage examples.
Models are loaded through HuggingFace Transformers. Example Python invocation:
1 2 3 4 5 6 7 |
from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("amadeusai/Amadeus-Verbo-MI-Qwen2.5-14B-PT-BR-Instruct") model = AutoModelForCausalLM.from_pretrained( "amadeusai/Amadeus-Verbo-MI-Qwen2.5-14B-PT-BR-Instruct", device_map="auto", torch_dtype=torch.bfloat16, ) |
Inference optimizations include setting device_map="auto" and low_cpu_mem_usage=True, reducing CPU-GPU data transfer. The 0.5B variant operates within a single 8GB GPU (~4GB peak memory), while the 72B model necessitates ≥8×80GB GPUs or ZeRO-3/FSDP for distributed inference. Latencies for greedy decoding are approximately 20ms/token (0.5B), 100ms/token (14B), and 400ms/token (72B, multi-GPU) (Cruz-Castañeda et al., 20 May 2025).
6. Context and Significance
Amadeus-Verbo represents a focused effort to democratize large-scale language modeling for Brazilian Portuguese, demonstrating the feasibility of adapting recent foundation models when appropriate data and compute are available. The approach—retaining the architectural foundation of Qwen2.5 while exclusively leveraging full-parameter supervised fine-tuning and systematic checkpoint merging—underscores the capacity for open-source LLM advancement in languages with limited ready-made resources. Performance parity or gains over multilingual baselines across retrieval, reasoning, classification, and comprehension tasks suggest that language-specialized LLMs can robustly meet local needs when adaptation protocols are rigorously implemented (Cruz-Castañeda et al., 20 May 2025).