Qwen2.5-7B: 7B Transformer LLM
- Qwen2.5-7B is a 7-billion parameter, dense, decoder-only Transformer model designed for diverse tasks like math, code generation, and multilingual understanding.
- It leverages advanced techniques such as Dual Chunk Attention and rotary positional encoding to support up to 128K tokens (or 1M in the instruct variant) with efficient inference.
- Extensive pre-training on 18 trillion tokens and fine-tuning via supervised and reinforcement learning drive state-of-the-art performance across academic benchmarks in reasoning, math, and coding.
Qwen2.5-7B is an open-weight, 7-billion parameter LLM in the Qwen2.5 series, engineered as a general-purpose, dense Transformer decoder. Building on foundational advances in pre-training scale, architecture refinement, and multi-stage post-training, Qwen2.5-7B is deployed extensively as both a base and instruction-tuned model. It constitutes the backbone for multiple specialized derivatives—including state-of-the-art math (Qwen2.5-Math-7B), code (Qwen2.5-Coder-7B), long-context (Qwen2.5-7B-Instruct-1M), and multilingual/Portuguese (Amadeus-Verbo Qwen2.5-7B) adaptations. Qwen2.5-7B consistently establishes or matches best-in-class results for its scale on diverse academic benchmarks in language understanding, reasoning, mathematics, coding, instruction following, and multilingual transfer (Qwen et al., 19 Dec 2024).
1. Architectural Specification
Qwen2.5-7B employs a dense, decoder-only Transformer with 28 layers, model hidden dimension , and grouped query attention (GQA) featuring 28 query heads and 4 key/value heads per layer, with each head size . Its feed-forward layers use a SwiGLU gating mechanism and intermediate size (base Qwen2.5-7B) or $18944$ (code specialization). Rotary positional encoding (RoPE) is combined with ABF scaling and QKV bias for robust long-context extrapolation. Normalization is RMSNorm applied in pre-norm configuration. The model uses a vocabulary of approximately 151,646 tokens; embedding weights are not tied to the output head at this scale (Yang et al., 15 Jul 2024, Qwen et al., 19 Dec 2024, Hui et al., 18 Sep 2024).
All Qwen2.5-7B variants use the same core architectural block design with no mixture-of-experts (MoE) layers, ensuring predictable inference footprint and full compatibility across base, instruction-tuned, long-context, code, and math-specialized versions. Dual Chunk Attention (DCA) with YARN enables efficient 128k-token context support for base and up to 1M-tokens for the Qwen2.5-7B-Instruct-1M variant (Qwen et al., 19 Dec 2024, Yang et al., 26 Jan 2025).
| Component | Qwen2.5-7B Base Value | Code-Specialized/Misc. |
|---|---|---|
| Layers | 28 | 28 |
| Hidden dimension | 3584 | 3584 |
| Attention heads | 28Q, 4KV | 28Q, 4KV |
| Head dimension | 128 | 128 |
| Feed-forward inner size | 14,336 | 18,944 (Coder/Math: varies) |
| Positional encoding | RoPE + ABF | RoPE (freq=1e6 for long ctx) |
| Norm/Activation | RMSNorm + SwiGLU | RMSNorm + SwiGLU |
| Context pre-train/infer | 32,768 / 128,000 | up to 1,000,000 (1M variant) |
2. Pre-Training Regimen and Data
The model is trained on 18 trillion tokens from a domain-balanced corpus (up from 7T in Qwen2), balancing down-sampled social media and e-commerce data with increased representation of technology, science, academic, code, and mathematical texts. High-quality data for math and code originates from Qwen2.5-Math and Qwen2.5-Coder corpora, combined with synthetic data generated by larger Qwen2-72B-Instruct and Qwen2-Math-72B peers. Quality is filtered by auxiliary reward models across multiple dimensions (Qwen et al., 19 Dec 2024).
Training objective is maximum-likelihood sequence modeling: Context length is first capped at 4,096 tokens, then curriculum-extended to 32,768. Context-specific long-sequence and position retrieval samples are included in long-context variants. Pre-training follows Chinchilla/Kaplan scaling-adjusted learning rates and batch sizing, with AdamW as optimizer (Qwen et al., 19 Dec 2024, Yang et al., 26 Jan 2025).
Specialized models (e.g., Qwen2.5-Math-7B, Qwen2.5-Coder-7B) reuse the backbone, but draw domain-optimized pre-training mixtures (70% code, 20% text, 10% math for Coder; >1T math tokens for Math; code-mixed and exam questions for Math) and integrate synthetic CoT/TIR or code data from strong teacher models (Yang et al., 18 Sep 2024, Hui et al., 18 Sep 2024).
3. Post-Training: Supervised and Reinforcement Methods
Following pre-training, Qwen2.5-7B undergoes:
- Supervised Fine-Tuning (SFT): Up to one million instruction-style examples from eight categories: long-form response, chain-of-thought (CoT) math, multilingual code, logical reasoning, structured data, instruction following (with execution-based rejection sampling), and prompt robustness. Training uses two epochs with context length up to 32k tokens and a learning rate annealed from to ; weight decay of 0.1 and gradient clipping at norm 1.0 (Qwen et al., 19 Dec 2024).
- Offline RL (DPO): Direct Preference Optimization on 150,000 preference pairs (positive/negative response) with the DPO loss function. An “Online Merging Optimizer” reduces alignment tax.
- Online RL (GRPO): Group Relative Policy Optimization for dialogue and preference alignment, using a PPO-variant surrogate and reward/prioritization (Qwen et al., 19 Dec 2024).
Specialized variants extend these steps. Qwen2.5-Math-7B uses an explicit self-improvement pipeline—iterative SFT/reward model selection, RM-guided rejection sampling, and group policy optimization—while Qwen2.5-Coder-7B integrates FIM (Fill-in-the-Middle) as an auxiliary objective and synthetic execution-filtered code samples (Yang et al., 18 Sep 2024, Hui et al., 18 Sep 2024).
4. Empirical Benchmarking
Qwen2.5-7B leads its class across standard academic benchmarks for LLMs. Representative scores (few/zero-shot):
| Task | Mistral-7B | Llama-3-8B | Gemma2-9B | Qwen2-7B | Qwen2.5-7B (base) | Qwen2.5-7B-Instruct |
|---|---|---|---|---|---|---|
| MMLU | 64.2 | 66.6 | 71.3 | 70.3 | 74.2 | – |
| BBH | 56.1 | 57.7 | 68.2 | 62.3 | 70.4 | – |
| GSM8K | 36.2 | 55.3 | 70.7 | 80.2 | 85.4 | 91.6 |
| MATH | 10.2 | 20.5 | 37.7 | 43.5 | 49.8 | 75.5 |
| HumanEval | 29.3 | 33.5 | 37.8 | 51.2 | 57.9 | 84.8 |
Qwen2.5-7B matches or outperforms all published open-weight LLMs at this scale on general, reasoning, math, and code benchmarks. Instruction tuning (SFT + RL) yields large gains, especially on math (MATH: 49.8→75.5%), program synthesis (HumanEval: 57.9→84.8%), and instruction-following tasks (Qwen et al., 19 Dec 2024).
Specialized Variant Performance
- Qwen2.5-Math-7B-Instruct: GSM8K (CoT) 95.2%, MATH 83.6%; supports CoT and Tool-Integrated Reasoning (TIR). Score-guided sampling (RM@N) further closes gap to SOTA (Yang et al., 18 Sep 2024).
- Qwen2.5-Coder-7B: HumanEval pass@1: 61.6% (base), MBPP: 76.9%, MultiPL-E (8 languages) 57.5%, strong FIM and long-context code retrieval (Hui et al., 18 Sep 2024).
- Qwen2.5-7B-Instruct-1M: >80% passkey retrieval at 1M-token context, matches 128K-ctx performance on short-context (Yang et al., 26 Jan 2025).
- Amadeus-Verbo Qwen2.5-7B: Achieves best STS (Pearson 0.81), Macro-F1 up to 0.74 in Portuguese tasks after full-parameter base and instruction tuning (Cruz-Castañeda et al., 20 May 2025).
5. Long-Context Scaling and Inference Efficiency
Qwen2.5-7B architecture supports inference with up to 128,000-token context by using Dual-Chunk Attention (DCA) with YARN temperature scaling, maintaining perplexity and retrieval accuracy across context sizes. The 1M-token extension (Qwen2.5-7B-Instruct-1M) integrates additional long-range pre-training, RoPE base frequency scaling, progressive length curricula, and multiple memory and kernel optimizations:
- Sparse Attention (MInference): Slash-pattern head-wise sparsity, reducing runtime by up to 10x for 1M context (Yang et al., 26 Jan 2025).
- BladeLLM kernels and DCPP: Pipeline parallel chunking, kernel fusion, and memory optimization deliver >25× acceleration over dense attention.
- Throughput: Community results indicate ~25 tokens/s (FP16) and ~60 tokens/s (4-bit quantized) on a single A100-40GB for the base 7B model (Qwen et al., 19 Dec 2024).
Quantized weights (4-bit, 8-bit) are supported via GPTQ, AWQ, and QLoRA; memory footprint for 4-bit 7B is ~4GB.
6. Multilinguality, Fine-Tuning, and Open-Source Ecosystem
Qwen2.5-7B natively supports ~30 languages, with model artifacts and deployment resources openly available from HuggingFace, ModelScope, and GitHub. All variants retain efficient inference on commodity GPUs (7B fits FP16 on 16GB VRAM; quantized deployment on 8GB; on-device PT-BR usage feasible). Fine-tuned and merged derivatives (Amadeus-Verbo for Portuguese, Qwen2.5-Math for advanced math, Coder for code generation) reuse the full Transformer without adapters, maintaining architectural integrity (Qwen et al., 19 Dec 2024, Cruz-Castañeda et al., 20 May 2025).
Instruction tuning is performed as full-parameter SFT, yielding improved accuracy for language-specific benchmarks, classification, similarity, and moderate gains on hard open-domain tasks. Spherical interpolation (SLERP) is used to merge base and instruct variants for greater versatility (Cruz-Castañeda et al., 20 May 2025).
7. Variant-Specific Enhancement and Limitations
Qwen2.5-7B contains no MoE layers at 7B scale, reflecting a trend toward maximal inference predictability and efficient quantization. Advanced post-training techniques (multi-stage RL, control-token expansion, structured verification) further enhance robustness in downstream tasks. The math and coding variants (Qwen2.5-Math-7B, Qwen2.5-Coder-7B) demonstrate that domain-adaptive pipelines can induce strong task-specific capabilities with no architectural modification.
Limitations remain in further context scaling—beyond 1M tokens—where intricate kernel/sparse scheduling and staged pre-training are required. For the Portuguese Amadeus-Verbo model, full-parameter tuning is computationally intensive (219 GPU-hours + model merging), and some domain gaps remain due to dataset limitations. All Qwen2.5-7B derivatives lack retrieval augmentation or external memory natively.
Qwen2.5-7B provides a high-performance, resource-efficient foundation for instruction-following, multi-language, coding, mathematical reasoning, and long-context NLP, with extensive support for quantization, fine-tuning, and cross-lingual adaptation, and with empirical results surpassing all previous Qwen1.5 and Qwen2 7B models and matching or exceeding Mistral-7B, Llama-3-8B, and Gemma2-9B across standard academic benchmarks (Qwen et al., 19 Dec 2024, Yang et al., 15 Jul 2024, Yang et al., 18 Sep 2024, Hui et al., 18 Sep 2024, Yang et al., 26 Jan 2025, Cruz-Castañeda et al., 20 May 2025).