Qwen2.5-72B Base Model Overview

Updated 16 April 2026

Qwen2.5-72B Base Model is a dense, open-weight LLM with 72 billion parameters optimized for efficiency and long-context processing.
It features an 80-layer decoder-only Transformer with Grouped Query Attention, SwiGLU activations, and pre-training on 18 trillion tokens.
Its architectural design and advanced training regimen enable performance competitive with larger models across diverse tasks and domains.

Qwen2.5-72B Base Model is a dense, open-weight, 72-billion-parameter LLM developed as part of the Qwen2.5 series. Leveraging extensive scaling of pre-training data and methodological improvements across the training pipeline, it achieves performance competitive with much larger models while maintaining optimal efficiency and broad research applicability (Qwen et al., 2024).

1. Architectural Design and Parameterization

Qwen2.5-72B is constructed as a decoder-only Transformer featuring 80 layers. Key architecture attributes include Grouped Query Attention (GQA) comprising 64 query heads and 8 combined key/value heads per layer, enabling both the computational throughput and attention expressivity necessary for large context windows. Each feed-forward block employs the SwiGLU activation function. Rotary positional embeddings (RoPE) with QKV bias are utilized to facilitate improved extrapolation to longer sequence lengths. The model uses pre-layer normalization with RMSNorm and incorporates a Byte-Level BPE vocabulary of 151,643 tokens with an additional 22 control tokens.

The training context length is set to 128K tokens and the generation length capped at 8K tokens. Tied embeddings are not used in the 72B configuration. This parametrization affords the model a suite-wide performance that matches or exceeds models such as Llama-3-405B (405B parameters) despite having approximately one-fifth the parameter count.

Table: Core Architectural Specifications (Qwen2.5 Series Extract)

Model Size	Layers	Heads (Q / KV)	Context / Gen Length
72B	80	64 / 8	128K / 8K

2. Pre-training Data Regimen and Objective

The model is pre-trained on 18 trillion tokens—over 2.5 times the volume used for its predecessor (Qwen2). Dataset composition is a curated mixture of web data, books, code (with contributions from Qwen2.5-Coder), mathematical reasoning (from Qwen2.5-Math), scientific text, and high-quality synthetic instances filtered by Qwen2-72B-Instruct and Qwen2-Math-RM-72B reward models. Domain balancing is implemented to attenuate template-like social or e-commerce entries and emphasize high-value content from technical, scientific, and academic sources.

The context curriculum consists of two phases: an initial phase at 4,096 tokens, followed by expansion to 32,768 tokens using ABF-scaled RoPE (base frequency 1M). The training objective is the conventional next-token cross-entropy: $\mathcal{L}_{CE}(\theta) = -\sum_{t=1}^T \log p_\theta\bigl(x_t \mid x_{<t}\bigr)$ where $x_{<t}$ denotes the prefix tokens preceding position $t$ . Hyperparameter schedules, including batch size and learning rate, follow scaling law prescriptions in the style of Chinchilla and Kaplan.

3. Post-training: Supervised and Reinforcement Learning Alignment

Following pre-training, the base model is subjected to several post-training stages:

Supervised Fine-Tuning (SFT): Conducted over two epochs on more than one million samples, this phase covers long-sequence generation (up to 8,192 tokens), chain-of-thought mathematical reasoning (including GPQA and GSM8K), code in approximately 40 languages (with static checks and unit tests), multi-turn instruction following, structured data comprehension, and logical plus cross-lingual reasoning. Training uses a sequence length of 32,768, linear learning rate decay from $7 \times 10^{-6}$ to $7 \times 10^{-7}$ , weight decay 0.1, and gradient clipping at 1.0.
Offline Reinforcement Learning: Direct Preference Optimization (DPO) operates on approximately 150,000 preference pairs derived from SFT outputs and filtered via human and automated review; training is conducted for one epoch at a learning rate of $7\times10^{-7}$ .
Online Reinforcement Learning: Group Relative Policy Optimization (GRPO) utilizes a proprietary reward model, trained on preference data (truthfulness, helpfulness, conciseness, relevance, harmlessness, debiasing). The RL objective approximates: $\mathcal{L}_{PPO} = -\,\mathbb{E}_{\tau\sim\pi_\theta}[\,r(\tau)\,\log \pi_\theta(\tau)\,]$ with additional entropy and clipping terms.

4. Quantization and Inference Optimization

Qwen2.5-72B is released in bfloat16 and quantized 8-bit and 4-bit variants. Quantization employs post-training procedures and QLoRA-style adapters to minimize performance degradation. Inference-acceleration technologies include FlashAttention and sparse chunked attention (DCA) complemented by YARN for generation lengths exceeding 32K tokens. This enables efficient deployment at context lengths up to 1M tokens for relevant variants (e.g., Qwen2.5-Turbo).

5. Empirical Performance and Comparative Evaluation

Benchmarking demonstrates that Qwen2.5-72B achieves or surpasses peer 70B+ models—and often larger proprietary and open LLMs—on standard evaluation suites in zero/few-shot regimes. For representative tasks:

Dataset	Llama-3-70B	Llama-3-405B	Qwen2.5-72B
MMLU (General)	79.5	85.2	86.1
BBH	81.0	85.9	86.3
TruthfulQA	45.6	—	60.4
GPQA (Math/Sci)	36.3	—	45.9
MATH (Math/Sci)	42.5	53.8	62.1
GSM8K (Math/Sci)	77.6	89.0	91.5
HumanEval (Coding)	48.2	61.0	59.1
MBPP (Coding)	70.4	73.0	84.7
Multilingual	79.9	—	89.6

On nearly all tasks, Qwen2.5-72B outperforms its predecessor (Qwen2-72B) and is competitive with much larger models, notably Llama-3-405B, despite the latter being approximately five times larger in parameter count.

6. Distinction from Qwen2.5-72B-Instruct and Specializations

The base Qwen2.5-72B model is trained exclusively via the cross-entropy objective. Its instruction-tuned sibling, Qwen2.5-72B-Instruct, undergoes subsequent SFT, DPO, and GRPO. This extended post-training yields substantial performance improvements: MMLU-Pro (+13 points), MATH (+21 points), GSM8K (+4 points), and HumanEval (+28 points). Instruct variants are also capable of passing human-preference benchmarks (e.g., IFEval, Arena-Hard), which are inaccessible to the pure base model.

Furthermore, Qwen2.5-72B serves as a foundation for specialized models, including Qwen2.5-Math, Qwen2.5-Coder, QwQ, and various multimodal instantiations, indicating broad extensibility.

7. Deployment, Accessibility, and Broader Impact

All open-weight Qwen2.5-72B variants are made available for research and practical deployment in multiple precisions. Quantized forms enable efficient inference on diverse hardware while maintaining competitive accuracy. The model forms a foundation for the MoE-based hosted solutions Qwen2.5-Turbo and Qwen2.5-Plus, which are accessible via Alibaba Cloud Model Studio and claim superior cost-effectiveness when benchmarked against GPT-4o-mini and GPT-4o.

Qwen2.5-72B's combination of efficient parameterization, long-context pre-training, and robust cross-domain generalization establishes a new standard for open-weight 72B-parameter LLMs and supports the rapid development of domain-specialized and multimodal models (Qwen et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

Qwen2.5 Technical Report (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Qwen2.5-72B Base Model.