Papers
Topics
Authors
Recent
2000 character limit reached

Qwen2.5-3B: Scalable 3B Transformer Model

Updated 14 November 2025
  • Qwen2.5-3B is a scalable Transformer language model featuring efficient decoding, advanced attention mechanisms, and support for long context lengths.
  • Its architecture employs Grouped Query Attention, rotary positional embeddings with ABF, and optimized pre-training on an 18T-token corpus for robust performance.
  • Specialized fine-tuning variants, including instruction-tuned and code-specialized models, deliver state-of-the-art results in dialogue, code generation, and multilingual benchmarks.

Qwen2.5-3B is a 3-billion-parameter, open-weight, decoder-only Transformer LLM developed as part of the Qwen2.5 series led by Alibaba Group. It occupies a critical niche as a scalable, resource-efficient LLM whose footprint and performance make it suitable for research, production, and fine-tuning in a wide spectrum of NLP, code, and multilingual tasks. The Qwen2.5-3B family includes standard base models, instruction-tuned and RLHF-aligned checkpoints, code-specialized variants, distilled ("DistilQwen2.5-3B") students, and language-adapted derivatives such as Amadeus-Verbo for Brazilian Portuguese.

1. Model Architecture

Qwen2.5-3B adopts and refines the core design principles that distinguish the Qwen2.5 family, with additional adaptations for efficiency and extensibility.

Model Variant Layers Hidden Dim Attn. Heads Parameters
Qwen2.5-3B (core/Amadeus) 24–36 2,048–2,560 16–32 2.8–3.1B
Qwen2.5-3B-Coder 36 2,048 16 Q, 2 KV ~3.1B
DistilQwen2.5-3B 24 4,096 32 ~3.1B

Context window sizes vary by variant, with support up to 32K tokens (core model), 8K (Amadeus-Verbo/Portuguese), and 131K via YARN for code-specialized models.

2. Pre-training and Post-training Workflow

Pre-training

Qwen2.5-3B is pre-trained on an 18T-token corpus – a substantial increase from earlier Qwen iterations – using a mixture that balances:

  • High-quality web content, filtered for redundancy
  • Synthetic expert-labeled mathematics and code (generated and reward-filtered by 72B-sized Qwen2/MATH instructors)
  • Domain-specific (coder, math, multilingual) corpora
  • Dozens of languages for robust multilinguality

Optimization uses AdamW with decoupled weight decay, with batch size and peak learning rate determined by scaling laws as a function of model and data size. Training begins with a 4,096-token context and extends to 32,768 tokens with modified RoPE frequency and ABF (Qwen et al., 19 Dec 2024), after which post-training follows.

Post-training

The canonical three-stage alignment framework includes:

  1. Supervised Fine-Tuning (SFT) over >1M diverse instruction–response pairs (including code, math, structured data, and long-sequence tasks).
  2. Direct Preference Optimization (DPO) on ~150K preference pairs from code/math/instruction tasks validated with execution or human reward.
  3. Group Relative Policy Optimization (GRPO), an online RL stage leveraging a reward model (72B parameters) optimized against rubrics such as truthfulness, helpfulness, and harmlessness. Distilled variants follow a black-box→white-box knowledge distillation pipeline (Wang et al., 21 Apr 2025), using data augmentation agents and logit-level token matching from 14B+/32B+/72B teachers.

3. Specialized Fine-Tuning and Derivatives

Qwen2.5-3B serves as a backbone for several specialized variants and downstream fine-tuning projects:

  • DistilQwen2.5-3B:

Distilled with teacher ensembles (Qwen-max, GPT-4/o) via multi-agent data augmentation and top-K logit fusion against large teachers. Achieves higher instruction following and efficiency (e.g., SQL completion use case: 2.6× speed up and only ~1% drop in Pass@1 vs. 7B) (Wang et al., 21 Apr 2025).

  • Qwen2.5-Coder-3B:

Specialized on code via ≈5.5T tokens of file- and repo-level filtered code, math, and grounded text, with both FIM (Fill in Middle) and extremely long context (YARN + RoPE up to 131K). Sets new state-of-the-art (SOTA) among 3B open models for code generation, completion, and reasoning: HumanEval pass@1 52.4% vs. 31.7% for StarCoder2-3B (Hui et al., 18 Sep 2024).

  • Amadeus-Verbo Qwen2.5-3B:

Full-parameter fine-tuning and merging for Brazilian Portuguese. Maintains or improves performance over the original Qwen2.5-3B-Instruct on multiple Portuguese downstream tasks, such as HATEBR (F1-macro: 0.70) and assin2_sts (Pearson: 0.80) (Cruz-Castañeda et al., 20 May 2025).

  • Dialogue Fine-Tuning (e.g., Movie Dialogues):

Qwen2.5-3B is shown to deliver near-SOTA dialogue with low VRAM (8GB), using 4-bit quantization + QLoRA and efficiency optimizations (FlashAttention, NEFTune, dense packing). G-Eval scores (0–1 scale) reach up to 0.69 for fluency, 0.68 for coherence after DPO-tuning. Human preference for DPO-tuned outputs achieves 52% vs. 37% for base fine-tuned and 11% for the original checkpoint (Gupta, 22 Feb 2025).

4. Quantization, Parallelism, and Efficient Deployment

The Qwen2.5-3B series is engineered for efficient deployment:

  • Quantization: 4-bit (GPTQ) and 8-bit (block-wise, scale–zero point) quantizations reduce VRAM below 8GB, enabling training and inference on consumer-grade GPUs, while int8 (LLM.int8()) targets maximum compression on CPUs (Qwen et al., 19 Dec 2024).
  • Inference throughput: >200 tokens/s on a single A100; >20 tokens/s on CPUs with 4-bit models; latency per token as low as 15 ms (Portuguese variant) (Cruz-Castañeda et al., 20 May 2025).
  • Gradient Accumulation/Memory Optimizations: High batch size simulated via gradient accumulation, NEFTune, and FlashAttention enable full fine-tuning under tight resource ceilings (e.g., RTX 3060 Ti, 8GB VRAM) (Gupta, 22 Feb 2025).
  • Distillation/Model Fusion: DistilQwen2.5-3B achieves ≈2.6× speed-up over 7B with near-equal accuracy in SQL completion (Wang et al., 21 Apr 2025).

5. Benchmarks and Quantitative Results

Qwen2.5-3B performs robustly across general-language, code, mathematics, and instruction-following benchmarks:

Task / Benchmark Qwen2.5-3B Base Qwen2.5-3B-Instruct DistilQwen2.5-3B Qwen2.5-Coder-3B
MMLU (5-shot) 65.6% 64.4% (redux) - -
GSM8K (CoT/few-shot) 79.1% 86.7% - 75.7%
HumanEval (0-shot, pass@1) 42.1% 74.4% - 52.4%
MBPP (0-shot) 57.1% 72.7% - 72.2%
IFEval (instruction follow) - 58.2% 67.03%* -
SQL Pass@1 (prod. use case) - - 17.9% -
MultiPL-E/LiveCodeBench - 60.2%/19.9% - -
Portuguese OAB_Exams - 0.47 - -

*DistilQwen2.5-3B "full pipeline"; highest performance from all distillation stages.

Additional points:

6. Applications, Limitations, and Future Directions

Applications

Limitations

  • Performance on the most challenging OOD and closed-domain tasks remains below best proprietary LLMs (GPT-4o, Claude-3.5) (Hui et al., 18 Sep 2024).
  • Context window and inference speed, though large/fast relative to model size, still pose scaling limitations for extreme long-form applications.
  • Data transparency: Some pre-training/instruction data details (domain mixture ratios, schedule hyperparameters) are undisclosed for select variants.
  • Societal and prompt sensitivity risks (Portuguese and otherwise) reflect inheritance from general-corpus pre-training.

Future Directions

  • Expanded, contemporary dialogue and multilingual corpora to reduce bias and enhance context adaptation.
  • Human-in-the-loop preference data to complement LLM-generated DPO targets.
  • Further size scaling (Qwen2.5-14B/32B/72B) and structured weight merging to maximize quality within fixed resource constraints.
  • Pruning and model distillation for ultra-low-latency, edge deployment.

7. Model Availability and Ecosystem

All major 3B-weight variants, including Qwen2.5-3B, Qwen2.5-Coder-3B, Amadeus-Verbo Qwen2.5-3B, and DistilQwen2.5-3B, are available as open weights on platforms such as HuggingFace. The series is foundational for ongoing language, code, and task adaptation research and provides direct support for both engineering production needs and experimental research in scalable LLMs. Leading benchmark results, strong multilingual and code abilities, and comprehensive quantized releases enable broad adoption in both academic and industrial contexts (Qwen et al., 19 Dec 2024, Hui et al., 18 Sep 2024, Gupta, 22 Feb 2025, Wang et al., 21 Apr 2025, Cruz-Castañeda et al., 20 May 2025).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Qwen2.5-3B Model.