Papers
Topics
Authors
Recent
Search
2000 character limit reached

Qwen2.5 LLM Series by Alibaba

Updated 1 April 2026
  • Qwen2.5 Language Model Series is a comprehensive suite of advanced LLMs that excel in multilingual understanding, reasoning, and multimodal tasks.
  • It leverages an 18-trillion token corpus, multi-stage supervised fine-tuning, and RLHF to deliver superior performance in language, mathematics, and code generation.
  • The series integrates scalable architectures with innovations like rotary embeddings, grouped-query attention, and efficient long-context processing for industrial applications.

The Qwen2.5 LLM series, spearheaded by Alibaba Group, represents a comprehensive family of open-weight, instruction-tuned, and multimodal LLMs distinguished by their rigorous scaling, multilingual and domain-intensive pre-training, and state-of-the-art downstream performance. Spanning dense and Mixture-of-Experts variants from sub-billion to 72B parameter regimes, Qwen2.5 models are engineered to excel in language understanding, reasoning, mathematics, code generation, tool use, document comprehension, and real-time multimodal tasks. The series builds on foundational architecture advances, a massive 18 trillion token corpus, multi-stage supervised fine-tuning, reinforcement learning with human feedback, and a suite of practical deployment optimizations (Qwen et al., 2024).

1. Model Architectures, Scaling & Variants

Qwen2.5 encompasses an extensive array of dense Transformer-decoder LLMs, API-exposed MoE variants, distilled students, and specialized derivatives. All Qwen2.5 models are based on a core autoregressive Transformer decoder backbone with grouped-query attention, SwiGLU activations, rotary positional embeddings (RoPE), Q/K/V bias, and RMS normalization with pre-LayerNorm (Qwen et al., 2024, Yang et al., 26 Jan 2025).

Model Params Layers Heads (Q/KV) Context (Dense) Context (MoE)
0.5B, 1.5B up to 28 14/12 32K
3B, 7B up to 48 16–32 128K
14B, 32B, 72B up to 80 40–64 128K
Qwen2.5-Turbo MoE 1M up to 1M Mixture-of-Experts
Qwen2.5-1M 7B/14B 28/48 1M 8K generation
DistilQwen2.5 0.5–72B Inherits teacher
Qwen2.5-VL 3B–72B ViT/LLM Dynamic Long video
Qwen2.5-Omni 7B Multi 32K Text, speech, img, video

All dense models are released in both bfloat16 and various quantized (int8, int4, GPTQ) formats. The Mixture-of-Experts line (Turbo, Plus) employs sparse conditional routing and fine-grained expert partitioning for cost-performance-optimized inference, supporting context windows up to 1 million tokens on select API endpoints (Qwen et al., 2024, Yang et al., 26 Jan 2025, Xu et al., 26 Mar 2025).

2. Pre-Training Corpus, Objectives, and Scaling

Qwen2.5 pre-training scales to 18T tokens, combining high-quality multilingual web text, books, code, mathematics, and academic data. The corpus construction is filtered using Qwen2-Instruct models for fluency, factuality, and diversity. Domain balancing upscales STEM, academic, and code while down-weighting e-commerce and social media (Qwen et al., 2024). Notably:

  • Integration of domain-specialized data, including Qwen2.5-Math and Qwen2.5-Coder seed corpora.
  • Static code validation/unit-testing and verified math reasoning chains.
  • Vocabulary expanded to 151,643 tokens, supporting a suite of 22 control tokens for tool calls, JSON/tabular I/O, and cross-modality tags.

Mathematical underpinning of training follows scaling laws:

μoptNαDβ,BoptNγDδ,L(N,D)=ANα+BDβ+C\mu_\text{opt} \propto N^\alpha D^\beta, \qquad B_\text{opt} \propto N^\gamma D^\delta, \qquad L(N,D) = AN^{-\alpha} + BD^{-\beta} + C

Guided by this, large batch sizes (up to 1M), decaying learning rates (10410610^{-4}\to10^{-6}), and staged context windows (4K\to32K for dense, up to 1M for long context) are adopted (Qwen et al., 2024, Yang et al., 26 Jan 2025).

3. Post-Training: Supervised, RLHF, and Specialized Knowledge Distillation

Supervised Fine-Tuning (SFT)

Over 1M instruction–response pairs, including extended sequences (up to 8K tokens), long-context I/O, and chain-of-thought (CoT) data are used. Data curation leverages back-translation, execution feedback, empirical rejection sampling, and code-level validation. All SFT is performed over multiepoch schedules with cross-entropy loss and gradient norm clipping (Qwen et al., 2024).

Reinforcement Learning with Human Feedback (RLHF)

A dual-stage pipeline:

  1. Offline: Direct Preference Optimization (DPO) utilizes \approx150K post-SFT response pairs, favoring reasoning, factuality, and complex instructions.
  2. Online: Group Relative Policy Optimization (GRPO) with batch size 2048, optimizing reward functions encompassing truthfulness, helpfulness, and debiasing. Output variance scheduling targets ambiguous queries.

Distilled and Lightweight Models

DistilQwen2.5 implements a hybrid distillation framework:

  • Black-box multi-agent instruction-rewriting (paraphrasing, chain-of-thought enhancement) under powerful (proprietary and public) LLM teachers.
  • White-box top-K logit-matching KL minimization for output distributions, and a novel model-fusion layer to inject select teacher hidden states. This yields student models with \sim2–5×\times lower inference cost and superior instruction-following to their non-distilled progenitors (Wang et al., 21 Apr 2025).

4. Long-Context and Multimodal Extensions

Long Context (Qwen2.5-1M)

Qwen2.5-1M pushes dense context length to 1M tokens (train and inference), achieved via:

  • Synthetic and natural long-range tasks (Fill-in-the-Middle, keyword-based retrieval, paragraph reordering).
  • Progressive curriculum ($4$K\to32KK\to65KK\to13110410610^{-4}\to10^{-6}026210410610^{-4}\to10^{-6}1 token stages), RoPE base frequency adaptation, and dual-chunk attention (DCA) / YaRN scaling for inference length-extrapolation (Yang et al., 26 Jan 2025).
  • Sparse attention schemes (MInference “Vertical-Slash”), chunked prefill, and hardware-adaptive kernel optimizations (BladeLLM), yielding 10410610^{-4}\to10^{-6}2–10410610^{-4}\to10^{-6}3 acceleration for million-token inputs without short-context regression.

Multimodal Capabilities

Qwen2.5-VL and Qwen2.5-Omni are flagship multimodal models:

  • Qwen2.5-VL: Native dynamic-resolution ViT (image/text/video), windowed attention (linear scaling), absolute time-encoded MRoPE for long video, and robust HTML-based document/GUI parsing. Achieves state-of-the-art on MMBench, MMStar, CC-OCR, and GUI agent benchmarks (Bai et al., 19 Feb 2025).
  • Qwen2.5-Omni: Unified perception (text 10410610^{-4}\to10^{-6}4 image, video, audio, speech), TMRoPE time-aligned cross-modal tokenization, Thinker–Talker dual-track generation that allows concurrent text reasoning and low-latency speech synthesis. Outperforms single-modal baselines on streaming speech and statistical multimodal tasks (Xu et al., 26 Mar 2025).

5. Empirical Results, Specialized Derivatives, and Real-World Deployment

Core Language, Reasoning, and Alignment

Qwen2.5-72B-IT and associated MoE variants outperform most open and many proprietary baselines:

Dataset Qwen2.5-72B-IT Llama-3-70B Qwen2.5-Plus GPT-4o-mini
MMLU-Pro 71.1 66.4 72.5
MATH 83.1 68.0 84.7 70.2
GSM8K 95.8 95.1 96.0 93.2
Arena-Hard 81.2 55.7 81.4
MT-Bench (9-pt) 9.35 8.79 9.30

Math, Code, and Agentic Extensions

  • Qwen2.5-Math (1.5B/7B/72B): Self-synthesized math data, iterative RM-enhanced SFT, GRPO RL, bilingual (EN/CH) tool integration and SoTA on GSM8K, MATH, GPQA (Yang et al., 2024).
  • Qwen2.5-Coder: ~250K code-tuning examples, static analysis/unit tests, >90% HumanEval pass@1.
  • QwQ (32B): “Unknown” calibration for QA, boosting precision-recall on open-domain tasks.
  • Multimodal: Qwen2.5-VL-72B matches/advances over GPT-4o and Claude3.5 on all visual, document, and GUI understanding benchmarks (Bai et al., 19 Feb 2025).

Efficiency, Distillation, and Industrial Use

Resource-efficient pipelines enable deployment of ~3B-parameter Qwen2.5/DistilQwen2.5 models on commodity GPU/VPU with strong alignment and generation capability (Wang et al., 21 Apr 2025, Gupta, 22 Feb 2025). SQL-completion deployments achieve 1.410410610^{-4}\to10^{-6}5 latency reduction at near-constant pass@1. Cloud-native distillation pipelines (KPP / DTP) facilitate bespoke domain adaptation.

6. Model Bias, Evaluation, and Responsible Deployment

A dedicated mechanistic interpretability study of Qwen2.5-Instruct models revealed scale-sensitive but persistent positional (primacy/recency) bias in financial decision contexts (Dimino et al., 25 Aug 2025). Core findings include:

  • Strong primacy bias at 1.5B/7B, reduced but not erased at 14B scales.
  • Bias arises in mid-to-late Transformer layers, concentrated in specific “universal bias heads.”
  • Bias is highly sensitive to prompt ordering and system framing; moderate system frames attenuate bias.
  • Layer- and head-wise ablations target mitigation. Best practices for responsible deployment in finance require continuous monitoring of 10410610^{-4}\to10^{-6}6 metrics and periodic bias auditing.

7. Future Prospects and Technical Evolution

The Qwen2.5 series establishes a unified foundation for further scaling, multimodal extension, length extrapolation, and lightweight instruction-following. The series has demonstrated:

  • Open-source 1M-token modeling with no degradation on standard tasks, enabling retrieval, summarization, and super-long document QA at industrial speeds (Yang et al., 26 Jan 2025).
  • Extensible APIs, deployment toolkits (bfloat16, int4, GPTQ), and support for edge (≤7B) and large-scale cloud (72B/MoE) scenarios.
  • Community-driven length-extrapolation, sparser attention patterns, and cross-modality expansion.

Progress beyond Qwen2.5 will plausibly involve deeper cross-modal agent integration, context scaling to tens of millions of tokens, robust debiasing via mechanistic regularization, and expanded open-reward datasets for safer RLHF and domain adaptation.


Key references:

(Qwen et al., 2024) Qwen2.5 Technical Report (Yang et al., 26 Jan 2025) Qwen2.5-1M Technical Report (Bai et al., 19 Feb 2025) Qwen2.5-VL Technical Report (Xu et al., 26 Mar 2025) Qwen2.5-Omni Technical Report (Wang et al., 21 Apr 2025) DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight LLMs (Dimino et al., 25 Aug 2025) Tracing Positional Bias in Financial Decision-Making: Mechanistic Insights from Qwen2.5 (Yang et al., 2024) Qwen2.5-Math Technical Report (Gupta, 22 Feb 2025) Fine-Tuning Qwen 2.5 3B for Realistic Movie Dialogue Generation

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Qwen2.5 Language Model Series.