Amadeus-Verbo: LLMs for Brazilian Portuguese

Updated 20 February 2026

Amadeus-Verbo is a family of decoder-only Transformer LLMs developed for Brazilian Portuguese using the Qwen2.5 architecture and full-parameter fine-tuning.
The models are trained on over 600K instruction–output pairs with rigorous preprocessing and SLERP-based checkpoint merging to adapt language nuances.
Benchmark results show that Amadeus-Verbo achieves parity or gains over multilingual baselines, while offering efficient deployment across diverse GPU configurations.

Amadeus-Verbo is a family of LLMs specifically developed for Brazilian Portuguese, extending the Qwen2.5 decoder-only Transformer architecture. The collection encompasses seven parameter scales and multiple variants targeting a spectrum of natural language processing tasks, with checkpoints openly provided for research and deployment. The development process focuses on full-parameter supervised fine-tuning and checkpoint merging, maintaining architectural consistency with Qwen2.5 while adapting model behavior to the linguistic and stylistic properties of Brazilian Portuguese (Cruz-Castañeda et al., 20 May 2025).

1. Model Architecture and Variants

Amadeus-Verbo models are built upon the Qwen2.5 decoder-only Transformer, preserving its architectural primitives:

Rotary position embeddings for learning positional relationships.
Causal self-attention blocks for autoregressive modeling.
GeLU activations in MLP layers.
Pre-layernorm (LayerNorm before both attention and MLP).
Mixed-precision (bfloat16) weights during fine-tuning.

The suite offers three primary variants (base-tuned, instruction-tuned, and merged-instruct) at the following scales:

Parameter Count	Layers (approx.)	Hidden Dim. (approx.)	Heads (approx.)
0.5B	8–16	512–1,024	8–16
1.5B	8–16	512–1,024	8–16
3B	24–32	2,048–4,096	32
7B	24–32	2,048–4,096	32
14B	40–60	5,632–8,192	48–64
32B	40–60	5,632–8,192	48–64
72B	64+	9,216	96

No modifications are made to architectural blocks for Brazilian Portuguese adaptation; instead, models undergo full-parameter supervised fine-tuning on an extensive corpus of Portuguese instructions. The final checkpoints maintain the expressive power of the original Qwen2.5, with instruction-following and generation style tuned to Brazilian Portuguese (Cruz-Castañeda et al., 20 May 2025).

2. Data, Preprocessing, and Training

The instruction-tuning process employs a corpus of approximately 600,000 Portuguese instruction–output pairs, with a high-quality subset of 78,840 instances used for supervised fine-tuning. Each instance includes an instruction, input (optionally), output, and full prompt text embedding these components with the “Resposta:” cue.

Preprocessing involves JSON-to-HuggingFace Datasets format conversion, Qwen2.5 tokenizer application (truncating/padding to max_length=8192), and filtering for quality and deduplication.

Fine-tuning utilizes full-parameter gradient-based optimization (no LoRA or prefix tuning), under Swift (with model-parallelism, DDP, ZeRO) and HuggingFace Transformers/Accelerate/Datasets. Hyperparameters are unified across model sizes: batch_size_per_device=1, 2 epochs, learning_rate=1e-5, AdamW optimizer (β₁=0.9, β₂=0.95, ε=1e-8), cosine lr_scheduler with warmup_ratio=0.05, weight_decay=0.01, max_grad_norm=1.0, and gradient_checkpointing enabled.

The training objective is next-token cross-entropy:

$L_{CE} = -\sum_{i=1}^N y_i \log \hat{y}_i$

where $y_i$ is the one-hot ground truth and $\hat{y}_i$ the predicted token softmax probability. Stability is ensured through weight decay, gradient clipping (to $\|g\|_2 \leq 1.0$ ), bfloat16, and activation checkpointing (Cruz-Castañeda et al., 20 May 2025).

3. Checkpoint Merging and Instruction Tuning via SLERP

Amadeus-Verbo leverages Spherical Linear Interpolation (SLERP) to merge checkpoints, utilizing the mergekit toolkit. The SLERP procedure interpolates between “base-instruct” and “instruct-fine-tuned” weights, layer-wise, controlled by merge factor $t \in [0,1]$ :

For parameter tensors $w_A$ and $w_B$ :

$\theta = \arccos \left( \frac{w_A \cdot w_B}{\|w_A\|\,\|w_B\|} \right )$

$w_{merged} = \frac{\sin((1-t)\theta)}{\sin\,\theta} w_A + \frac{\sin(t\theta)}{\sin\,\theta} w_B$

Layer-specific merge factors are configured: self-attention layers ( $t \in \{0,0.5,0.3,0.7,1\}$ ), MLP layers ( $y_i$ 0), others ( $y_i$ 1). The result is the “MI” (merged instruct) checkpoints, e.g., Amadeus-Verbo-MI-Qwen2.5-X B-PT-BR-Instruct (Cruz-Castañeda et al., 20 May 2025).

4. Benchmark Evaluation and Comparative Results

Evaluation employs a natively Portuguese adaptation of the EleutherAI LM Evaluation Harness, spanning nine tasks:

assin2_rte (F1 Macro)
assin2_sts (Pearson)
bluex (reading comprehension, F1 Macro)
enem (ENEM exam, Accuracy)
faquad_nli (F1 Macro)
hatebr (F1 Macro)
hate_speech (F1 Macro)
tweetsentbr (F1 Macro)
oab_exams (Accuracy)

Testing is performed in few-shot (3–25 exemplars per prompt) regimes on 8×A100 or H200 GPUs.

Across all model sizes, Amadeus-Verbo’s base-instruct (BI), fine-tuned instruct (FI), and merged instruct (MI) variants consistently match or surpass the performance of vanilla Qwen2.5-Instruct. For the 14B model, key results include: assin2_rte (0.95 F1 Macro, equal to baseline); faquad_nli (0.83 F1 Macro, +0.03 over baseline); enem (0.81 Accuracy, matched baseline). At the 32B and 72B scales, improvements reach +0.04 in Pearson or F1 on STS and NLI tasks (Cruz-Castañeda et al., 20 May 2025).

5. Deployment, Accessibility, and Inference Efficiency

All Amadeus-Verbo models (BI, FI, MI at seven sizes) are distributed via HuggingFace Hub (https://huggingface.co/collections/amadeusai/amadeus-verbo-qwen25-67cf2e7aae69ce2b3bcdcfda). Model repositories provide standard artifacts: config files, tokenizer assets, PyTorch or Safetensors model weights, and README with usage examples.

Models are loaded through HuggingFace Transformers. Example Python invocation:

$y_i$ 2

Inference optimizations include setting device_map="auto" and low_cpu_mem_usage=True, reducing CPU-GPU data transfer. The 0.5B variant operates within a single 8GB GPU (~4GB peak memory), while the 72B model necessitates ≥8×80GB GPUs or ZeRO-3/FSDP for distributed inference. Latencies for greedy decoding are approximately 20ms/token (0.5B), 100ms/token (14B), and 400ms/token (72B, multi-GPU) (Cruz-Castañeda et al., 20 May 2025).

6. Context and Significance

Amadeus-Verbo represents a focused effort to democratize large-scale language modeling for Brazilian Portuguese, demonstrating the feasibility of adapting recent foundation models when appropriate data and compute are available. The approach—retaining the architectural foundation of Qwen2.5 while exclusively leveraging full-parameter supervised fine-tuning and systematic checkpoint merging—underscores the capacity for open-source LLM advancement in languages with limited ready-made resources. Performance parity or gains over multilingual baselines across retrieval, reasoning, classification, and comprehension tasks suggest that language-specialized LLMs can robustly meet local needs when adaptation protocols are rigorously implemented (Cruz-Castañeda et al., 20 May 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Amadeus-Verbo Technical Report: The powerful Qwen2.5 family models trained in Portuguese (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Amadeus-Verbo.

Amadeus-Verbo: LLMs for Brazilian Portuguese

1. Model Architecture and Variants

2. Data, Preprocessing, and Training

3. Checkpoint Merging and Instruction Tuning via SLERP

4. Benchmark Evaluation and Comparative Results

5. Deployment, Accessibility, and Inference Efficiency

6. Context and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Amadeus-Verbo: LLMs for Brazilian Portuguese

1. Model Architecture and Variants

2. Data, Preprocessing, and Training

3. Checkpoint Merging and Instruction Tuning via SLERP

4. Benchmark Evaluation and Comparative Results

5. Deployment, Accessibility, and Inference Efficiency

6. Context and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research