Qwen2.5-72B Base Model
- Qwen2.5-72B is a 72-billion-parameter, decoder-only Transformer employing innovations like Grouped Query Attention, SwiGLU, and RoPE to enhance long-context and multilingual performance.
- Its architecture integrates efficient memory handling and stability techniques, such as pre-LayerNorm with RMSNorm, enabling context extensions up to 32,768 tokens.
- Benchmark evaluations reveal that Qwen2.5-72B achieves competitive zero- and few-shot capabilities in reasoning, mathematics, coding, and multilingual tasks often surpassing larger models.
Qwen2.5-72B Base Model is a 72-billion-parameter, decoder-only Transformer LLM introduced in the Qwen2.5 series. Developed as an open-weight, research-accessible system, it incorporates advanced architectural and efficiency innovations and is pre-trained on a highly curated, large-scale, multilingual, multidisciplinary corpus. Qwen2.5-72B demonstrates top-tier zero- and few-shot capabilities across general, reasoning, coding, mathematical, and multilingual benchmarks, often matching or surpassing models with significantly higher parameter counts (Qwen et al., 19 Dec 2024).
1. Model Architecture
Qwen2.5-72B implements a decoder-only Transformer following the GPT-family paradigm. Its configuration is summarized by the following specifications:
- Transformer layers: 80
- Hidden (model) dimension (H): 12,288
- Attention heads: 64 query heads / 8 value heads (Grouped Query Attention, GQA)
- Per-head dimension: 192
- MLP inner dimension: 49,152
- Pre-LayerNorm with RMSNorm for training stability
- Nonlinearity: SwiGLU
- Rotary Positional Embeddings (RoPE) with tunable base frequency
- QKV bias terms for improved length extrapolation
- Total parameters: ~72B
Grouped Query Attention (GQA), a key architectural feature, decouples the number of query and key-value heads to reduce KV cache memory requirements, facilitating faster inference and extended context handling. SwiGLU activation enhances expressivity in the feed-forward network. RoPE, with a larger and adjustable frequency base, and QKV bias terms, extend context window generalization, making the model suitable for very long sequence extrapolation. Pre-LayerNorm (applied before the residual connection) and RMSNorm jointly address numerical stability.
| Model | Layers | Heads (Q/KV) | Hidden (H) | Params |
|---|---|---|---|---|
| 72B | 80 | 64 / 8 | 12,288 | 72 B |
| 32B | 64 | 40 / 8 | 8,192 | 32 B |
| 14B | 48 | 40 / 8 | 6,656 | 14 B |
2. Pre-training Data
Qwen2.5-72B is pre-trained on 18 trillion tokens—a substantial increase from previous iterations—across diverse domains:
- Source modalities: Predominantly high-quality text from web crawls, books, scientific literature, code (including CodeQwen1.5 and The Stack, totaling 3 TB), mathematics datasets (GSM8K, MATH, GPQA, theorem proofs), and multilingual corpora.
- Synthetic data: Generated using Qwen2.5-72B-Instruct and Qwen2-Math-72B models, further filtered by reward models.
- Curation: Qwen2-Instruct models applied high-quality filtering and multi-dimensional scoring to maximize data value. Underrepresented high-value domains (academic, scientific, technical) were up-sampled, while template-heavy domains (e-commerce, social media) were down-sampled.
- Mixture and weighting: Batches dynamically mix 40% long-context and 60% shorter sequences, promoting both context window extrapolation (up to 32,768 tokens) and coverage diversity.
This extensive pre-training foundation underpins the model’s generalization in reasoning, mathematics, coding, and multilingual understanding.
3. Pre-training Objective and Loss
Qwen2.5-72B utilizes the autoregressive next-token prediction objective, with cross-entropy loss over the model’s vocabulary . For a token sequence , the training loss is:
where denotes the softmax-normalized likelihood computed from decoder logits. The loss can also be expressed in expectation:
Inference-time decoding employs greedy or top- sampling strategies.
4. Training Hyperparameters and Compute
Training of Qwen2.5-72B adheres to scaling law principles and leverages large-scale distributed infrastructure:
- Optimizer: AdamW, , , weight decay: 0.1
- Peak learning rate:
- Warmup: 5% of total steps, linear
- Learning rate decay: Cosine, decaying to zero
- Batch size: 1 million tokens per batch (≈ 80 sequences of 32,768 tokens each)
- Sequence length: Phased—initially 4,096 tokens (80% of steps), then 32,768 tokens (final 20% of steps, with RoPE base change from 10k to 1M and ABF)
- Total training steps: ~350,000
- Total compute: ~7.8×10²⁴ FLOPs (≈ 7.8 ZettaFLOPs)
- Precision: bfloat16
The context length curriculum supports next-token prediction at large window sizes without catastrophic loss degradation, an effect enabled by tuned RoPE and attention mechanisms.
5. Quantization and Efficiency Techniques
- Quantized variants: Instruction-tuned models (Qwen2.5-*-Instruct) are released in 4-bit and 8-bit representations, using GPTQ and QLoRA frameworks. The base model is published at bfloat16 precision for maximal research flexibility.
- Memory reduction: GQA decreases key-value cache consumption by approximately 1.5× compared to conventional transformer attention in large models.
- Computation acceleration: FlashAttention and Triton-based kernels are deployed for efficient self-attention.
- Context extension: Combinations of YARN, Dual Chunk Attention (DCA), and sparse RoPE enable context lengths of 128k–1M tokens via post-hoc (training-free) modifications.
These design factors collectively allow for lower resource requirements at inference and extended long-context applications.
6. Downstream Performance
Qwen2.5-72B has been benchmarked in zero- and few-shot regimes across general knowledge (MMLU, BBH), mathematics (GSM8K, MATH), code synthesis (HumanEval, MBPP), and multilingual (Multi-Exam) tasks. Key comparative metrics include:
| Dataset | Metric | Llama-3-70B | Llama-3-405B | Qwen2-72B | Qwen2.5-72B |
|---|---|---|---|---|---|
| General | MMLU | 79.5 | 85.2 | 84.2 | 86.1 |
| BBH | 81.0 | 85.9 | 82.4 | 86.3 | |
| Mathematics | GSM8K | 77.6 | 89.0 | 89.0 | 91.5 |
| MATH | 42.5 | 53.8 | 50.9 | 62.1 | |
| Coding | HumanEval | 48.2 | 61.0 | 64.6 | 59.1† |
| MBPP | 70.4 | 73.0 | 76.9 | 84.7 | |
| Multilingual | Multi-Exam | 70.0 | — | 76.6 | 78.7 |
† HumanEval score decrease attributed to different sampling protocols.
Qwen2.5-72B matches or surpasses Llama-3-405B on most benchmarks at approximately one-fifth the parameter count. Significant gains are observed against Qwen2-72B, specifically +1.9 points on MMLU and +11.2 points on MATH. Performance on multilingual and code benchmarks highlights the efficacy of targeted pre-training curation and scaling.
7. Research Context and Significance
The Qwen2.5-72B base model exemplifies a prevailing trend of public release of performant, extensible LLMs with open weights. Its design is anchored in contemporary advances including GQA, SwiGLU, and large-scale data curation, incorporating lessons from prior open and proprietary models (notably scaling law adherence and length generalization). The model’s broad performance profile—achieved with 72B parameters—serves as a proof point for the benefits of high-volume, high-quality data scaling and targeted optimization in LLM pre-training (Qwen et al., 19 Dec 2024).
Qwen2.5-72B provides a foundation for subsequent instruction-tuned, specialist, and multimodal architectures within the Qwen2.5 ecosystem, including coding (Qwen2.5-Coder), mathematics (Qwen2.5-Math), and vision-LLMs, thereby impacting downstream research in aligned, domain-specialized, and long-context language modeling.