Papers
Topics
Authors
Recent
2000 character limit reached

Qwen2.5-72B Base Model

Updated 22 December 2025
  • Qwen2.5-72B is a 72-billion-parameter, decoder-only Transformer employing innovations like Grouped Query Attention, SwiGLU, and RoPE to enhance long-context and multilingual performance.
  • Its architecture integrates efficient memory handling and stability techniques, such as pre-LayerNorm with RMSNorm, enabling context extensions up to 32,768 tokens.
  • Benchmark evaluations reveal that Qwen2.5-72B achieves competitive zero- and few-shot capabilities in reasoning, mathematics, coding, and multilingual tasks often surpassing larger models.

Qwen2.5-72B Base Model is a 72-billion-parameter, decoder-only Transformer LLM introduced in the Qwen2.5 series. Developed as an open-weight, research-accessible system, it incorporates advanced architectural and efficiency innovations and is pre-trained on a highly curated, large-scale, multilingual, multidisciplinary corpus. Qwen2.5-72B demonstrates top-tier zero- and few-shot capabilities across general, reasoning, coding, mathematical, and multilingual benchmarks, often matching or surpassing models with significantly higher parameter counts (Qwen et al., 19 Dec 2024).

1. Model Architecture

Qwen2.5-72B implements a decoder-only Transformer following the GPT-family paradigm. Its configuration is summarized by the following specifications:

  • Transformer layers: 80
  • Hidden (model) dimension (H): 12,288
  • Attention heads: 64 query heads / 8 value heads (Grouped Query Attention, GQA)
  • Per-head dimension: 192
  • MLP inner dimension: 49,152
  • Pre-LayerNorm with RMSNorm for training stability
  • Nonlinearity: SwiGLU
  • Rotary Positional Embeddings (RoPE) with tunable base frequency
  • QKV bias terms for improved length extrapolation
  • Total parameters: ~72B

Grouped Query Attention (GQA), a key architectural feature, decouples the number of query and key-value heads to reduce KV cache memory requirements, facilitating faster inference and extended context handling. SwiGLU activation enhances expressivity in the feed-forward network. RoPE, with a larger and adjustable frequency base, and QKV bias terms, extend context window generalization, making the model suitable for very long sequence extrapolation. Pre-LayerNorm (applied before the residual connection) and RMSNorm jointly address numerical stability.

Model Layers Heads (Q/KV) Hidden (H) Params
72B 80 64 / 8 12,288 72 B
32B 64 40 / 8 8,192 32 B
14B 48 40 / 8 6,656 14 B

2. Pre-training Data

Qwen2.5-72B is pre-trained on 18 trillion tokens—a substantial increase from previous iterations—across diverse domains:

  • Source modalities: Predominantly high-quality text from web crawls, books, scientific literature, code (including CodeQwen1.5 and The Stack, totaling 3 TB), mathematics datasets (GSM8K, MATH, GPQA, theorem proofs), and multilingual corpora.
  • Synthetic data: Generated using Qwen2.5-72B-Instruct and Qwen2-Math-72B models, further filtered by reward models.
  • Curation: Qwen2-Instruct models applied high-quality filtering and multi-dimensional scoring to maximize data value. Underrepresented high-value domains (academic, scientific, technical) were up-sampled, while template-heavy domains (e-commerce, social media) were down-sampled.
  • Mixture and weighting: Batches dynamically mix 40% long-context and 60% shorter sequences, promoting both context window extrapolation (up to 32,768 tokens) and coverage diversity.

This extensive pre-training foundation underpins the model’s generalization in reasoning, mathematics, coding, and multilingual understanding.

3. Pre-training Objective and Loss

Qwen2.5-72B utilizes the autoregressive next-token prediction objective, with cross-entropy loss over the model’s vocabulary VV. For a token sequence s=(s1,,sT)s = (s_1,\ldots,s_T), the training loss is:

L(θ)=t=1Tlogpθ(sts<t)L(\theta) = -\sum_{t=1}^T \log p_\theta(s_t \mid s_{<t})

where pθp_\theta denotes the softmax-normalized likelihood computed from decoder logits. The loss can also be expressed in expectation:

L(θ)=E(s)[logpθ(s)],pθ(s)=t=1Tpθ(sts<t)L(\theta) = \mathbb{E}_{(s)} \left[ -\log p_\theta(s) \right], \quad p_\theta(s) = \prod_{t=1}^T p_\theta(s_t \mid s_{<t})

Inference-time decoding employs greedy or top-kk sampling strategies.

4. Training Hyperparameters and Compute

Training of Qwen2.5-72B adheres to scaling law principles and leverages large-scale distributed infrastructure:

  • Optimizer: AdamW, β1=0.9\beta_1 = 0.9, β2=0.95\beta_2 = 0.95, weight decay: 0.1
  • Peak learning rate: 1.2×1041.2\times 10^{-4}
  • Warmup: 5% of total steps, linear
  • Learning rate decay: Cosine, decaying to zero
  • Batch size: 1 million tokens per batch (≈ 80 sequences of 32,768 tokens each)
  • Sequence length: Phased—initially 4,096 tokens (80% of steps), then 32,768 tokens (final 20% of steps, with RoPE base change from 10k to 1M and ABF)
  • Total training steps: ~350,000
  • Total compute: ~7.8×10²⁴ FLOPs (≈ 7.8 ZettaFLOPs)
  • Precision: bfloat16

The context length curriculum supports next-token prediction at large window sizes without catastrophic loss degradation, an effect enabled by tuned RoPE and attention mechanisms.

5. Quantization and Efficiency Techniques

  • Quantized variants: Instruction-tuned models (Qwen2.5-*-Instruct) are released in 4-bit and 8-bit representations, using GPTQ and QLoRA frameworks. The base model is published at bfloat16 precision for maximal research flexibility.
  • Memory reduction: GQA decreases key-value cache consumption by approximately 1.5× compared to conventional transformer attention in large models.
  • Computation acceleration: FlashAttention and Triton-based kernels are deployed for efficient self-attention.
  • Context extension: Combinations of YARN, Dual Chunk Attention (DCA), and sparse RoPE enable context lengths of 128k–1M tokens via post-hoc (training-free) modifications.

These design factors collectively allow for lower resource requirements at inference and extended long-context applications.

6. Downstream Performance

Qwen2.5-72B has been benchmarked in zero- and few-shot regimes across general knowledge (MMLU, BBH), mathematics (GSM8K, MATH), code synthesis (HumanEval, MBPP), and multilingual (Multi-Exam) tasks. Key comparative metrics include:

Dataset Metric Llama-3-70B Llama-3-405B Qwen2-72B Qwen2.5-72B
General MMLU 79.5 85.2 84.2 86.1
BBH 81.0 85.9 82.4 86.3
Mathematics GSM8K 77.6 89.0 89.0 91.5
MATH 42.5 53.8 50.9 62.1
Coding HumanEval 48.2 61.0 64.6 59.1
MBPP 70.4 73.0 76.9 84.7
Multilingual Multi-Exam 70.0 76.6 78.7

† HumanEval score decrease attributed to different sampling protocols.

Qwen2.5-72B matches or surpasses Llama-3-405B on most benchmarks at approximately one-fifth the parameter count. Significant gains are observed against Qwen2-72B, specifically +1.9 points on MMLU and +11.2 points on MATH. Performance on multilingual and code benchmarks highlights the efficacy of targeted pre-training curation and scaling.

7. Research Context and Significance

The Qwen2.5-72B base model exemplifies a prevailing trend of public release of performant, extensible LLMs with open weights. Its design is anchored in contemporary advances including GQA, SwiGLU, and large-scale data curation, incorporating lessons from prior open and proprietary models (notably scaling law adherence and length generalization). The model’s broad performance profile—achieved with 72B parameters—serves as a proof point for the benefits of high-volume, high-quality data scaling and targeted optimization in LLM pre-training (Qwen et al., 19 Dec 2024).

Qwen2.5-72B provides a foundation for subsequent instruction-tuned, specialist, and multimodal architectures within the Qwen2.5 ecosystem, including coding (Qwen2.5-Coder), mathematics (Qwen2.5-Math), and vision-LLMs, thereby impacting downstream research in aligned, domain-specialized, and long-context language modeling.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Qwen-72B Base Model.