Qwen2.5-Plus: Sparse MoE LLM

Updated 1 April 2026

Qwen2.5-Plus is a proprietary LLM variant that employs sparse Mixture-of-Experts layers to boost efficiency and scalability.
It utilizes an 80-layer decoder-only Transformer architecture with state-of-the-art supervised and reinforcement fine-tuning techniques.
The model achieves competitive performance in mathematics, coding, and multimodal tasks while significantly reducing inference costs compared to rivals.

Qwen2.5-Plus is a proprietary high-capacity Mixture-of-Experts (MoE) LLM variant within the Qwen2.5 family, offered as a hosted API via Alibaba Cloud Model Studio. It extends the open-weight Qwen2.5 models (ranging from 0.5B to 72B parameters) by introducing sparse MoE layers, enabling efficient inference and competitive performance relative to closed-weight counterparts such as GPT-4o, while maintaining a significantly lower total inference cost. Qwen2.5-Plus incorporates advances in data scaling, training methodology, and model architecture, and serves as the foundation for several specialized domain models in mathematics, coding, and multimodal tasks (Qwen et al., 2024).

1. Architecture and MoE Mechanism

Qwen2.5-Plus adopts a decoder-only Transformer backbone, closely following the Qwen2.5-72B model, with the following main specifications:

80 Transformer decoder layers
Multi-head self-attention: 64 query heads and 8 key/value heads per layer
Rotary positional embeddings (RoPE) with pre-normalization and RMSNorm
SwiGLU activation in feed-forward networks (FFNs)
Maximum context length: 131,072 tokens (generation up to 8,192 tokens)

The critical architectural distinction is the incorporation of sparse MoE feed-forward sublayers in place of dense FFNs. Specifically:

Each layer’s two-layer FFN, $FFN(x)=W_2\,\phi(W_1x+b_1)+b_2$ , is substituted with an MoE sublayer containing $E\approx32$ FFN “experts,” denoted $\{E_i(x)\}$ .
A gating network $g:\mathbb{R}^d\to\mathbb{R}^E$ outputs logits per token; gating probabilities $p(x)=\mathrm{softmax}(g(x))$ are used for expert selection.
Top- $K$ routing ( $K=2$ ) is performed, resulting in $\mathrm{MoE}(x)=\sum_{i\in \mathrm{Top}-K(p(x))} p_i(x) \cdot E_i(x)$ .
Experts are shared across layers (“shared-expert routing”); token-level segmentation allows each token to access a unique subset of experts, optimizing memory overhead.

The result is an increase in total parameter count to approximately 120 billion (with about 32 billion parameters activated per token), while retaining a hidden dimension 4× the embedding size (≈32,768). The attention, embedding, and normalization subcomponents are identical to dense Qwen2.5-72B.

2. Pre-training Regimen

Qwen2.5-Plus is pretrained on 18 trillion tokens of high-quality, domain-balanced data—a substantial increase from the 7 trillion tokens used in Qwen2. The training corpus includes:

Filtered web crawl (with e-commerce and social domains down-sampled; science/academia up-sampled)
Code and math corpora (from Qwen2.5-Coder and Qwen2.5-Math)
Synthetic examples generated by Qwen2-72B-Instruct and Qwen2-Math-72B-Instruct, rigorously filtered via reward models

A byte-level BPE tokenizer (vocab size 151,643) with language-agnostic control tokens is applied in preprocessing. The training objective is the standard autoregressive next-token prediction,

$L_{NTP} = -\sum_{t=1}^T \log P_\theta(w_t|w_{<t}),$

with no auxiliary distillation or contrastive losses. Context lengths during pre-training progress in two stages: starting at 4,096 and increasing to 32,768 tokens (Qwen2.5-Plus maintains this maximum). RoPE frequencies are increased to extend positional fidelity for long contexts, moving the base from 10,000 to 1,000,000 via ABF.

3. Supervised Fine-tuning and Reinforcement Learning

The post-training phase combines large-scale supervised fine-tuning and multi-stage reinforcement learning:

Supervised fine-tuning (SFT): Over 1 million high-quality instruction-response pairs, encompassing long-sequence generation (up to 8,192 tokens), mathematical reasoning, code synthesis and execution validation, structured data (JSON, tabular), and cross-lingual robustness. SFT uses two epochs, a sequence length of 32,768, linear learning rate decay ( $7 \times 10^{-6} \to 7 \times 10^{-7}$ ), weight decay (0.1), and gradient clipping (norm 1.0).
Offline RL (Direct Preference Optimization): 150,000 positive/negative pairs built from execution feedback and human review in math and code, optimized for 1 epoch ( $E\approx32$ 0) with an Online Merging Optimizer.
Online RL (Group Relative Policy Optimization): A reward model trained on a mixture of proprietary and public annotated instances (truthfulness, helpfulness, harmlessness, debiasing). Policy is optimized via GRPO, a PPO-variant:

$E\approx32$ 1

where $E\approx32$ 2, $E\approx32$ 3.

4. Empirical Performance and Cost-effectiveness

Qwen2.5-Plus demonstrates strong results on a wide set of automatic benchmarks, spanning English and Chinese comprehension, mathematics, coding, and reasoning. Comparative results on selected benchmarks:

Table: English Automatic Evaluation (%)

Model	IF	Know.	Compr.	Coding	Math	Reasoning
GPT-4o (2024-08-06)	83.28	68.08	76.51	58.05	52.36	66.45
GPT-4o (2024-11-20)	80.06	65.25	79.07	60.19	49.74	67.07
Claude3.5-sonnet (Oct)	84.22	74.61	79.02	67.17	48.67	70.20
Qwen2.5-Plus	83.18	68.41	79.35	59.58	62.52	66.92

Table: Select 70B+ Instruct Evaluation (%)

Model	MMLU-Pro	MATH	GSM8K	HumanEval	IFEval	MTbench
Llama-3.1-70B	66.4	68.0	95.1	80.5	83.6	8.79
Llama-3.1-405B	73.3	73.8	96.8	89.0	86.0	9.08
Qwen2.5-72B	71.1	83.1	95.8	86.6	84.1	9.35
Qwen2.5-Plus	72.5	84.7	96.0	87.8	86.3	9.30

Qwen2.5-Plus achieves near-parity with GPT-4o in both English and Chinese, often outperforming on mathematics and code synthesis benchmarks (e.g., GSM8K: 93.0% vs. GPT-4o 87.4%; HumanEval: 59.1% vs. 58.0%). Cost analysis on Alibaba Cloud Model Studio pricing highlights approximately double the cost-effectiveness relative to GPT-4o ( $E\approx32$ 40.25/1,000 tokens), with 2–3× token-throughput at comparable quality.

5. Inference, Quantization, and Deployment

MoE variants, including Qwen2.5-Plus, are exclusively available as hosted inference APIs; open-weight Qwen2.5-Instruct models up to 72B parameters are available with INT4 and FP8 quantized versions, but Qwen2.5-Plus quantization is not publicly released.

Deployment footprint: Inference on a 4×A100 (80 GB) node requires approximately 96 GB bf16 weights and 24 GB KV cache (supporting up to 128k context), with per-token peak activation of ≈20 GB due to MoE sparse routing.
Mixed-precision: Quality is maintained when hot-swapping to 16-bit compute for experts, facilitating mixed-precision acceleration.
Public Access: No open-weight or quantized versions of the Qwen2.5-Plus MoE variant are published to date.

6. Specialized Variants and Application Domains

Qwen2.5-Plus underpins several domain-targeted and multimodal extensions:

Qwen2.5-Math: Chain-of-thought math reasoning.
Qwen2.5-Coder: Code generation in 40+ languages, execution-validated.
QwQ: Scientific question-answering.
Qwen-VL: Vision-language modeling via multimodal extension.

Specific use cases emphasize complex multi-step math problem solving, competitive code synthesis, long-document summarization and Q&A up to 128k context, and high-precision bilingual instruction following.

7. Contextual Significance and Prospects

Qwen2.5-Plus represents an effective integration of MoE architectures into large-scale instruction-aligned LLMs, matching closed-weight SOTA models at substantially reduced compute and monetary cost. Its modular design, efficient scaling via sparse expert routing, and robust post-training protocol position it as a foundation for a new generation of LLM-powered agent, reasoning, and multimodal solutions (Qwen et al., 2024). As of its release, the absence of open weights or quantization for Qwen2.5-Plus distinguishes its deployment model from the rest of the Qwen2.5 open-weight track. A plausible implication is that continued progress in MoE architectures and their deployment could further expand the feasible compute-accuracy frontier for industrial-scale LLMs.

Markdown Report Issue Upgrade to Chat

References (1)

Qwen2.5 Technical Report (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Qwen2.5-Plus.