GLM-4.5: Sparse MoE Model for ARC Tasks

Updated 11 August 2025

GLM-4.5 is a large-scale, sparse Mixture-of-Experts model that uses hybrid reasoning modes to efficiently tackle agentic, reasoning, and coding tasks.
It employs a multi-stage training pipeline with 23T tokens, layered MoE routing, and reinforcement learning to achieve precise expert iteration and adaptive performance.
The model achieves high ARC benchmark scores with superior parameter efficiency and features a compact GLM-4.5-Air variant for research-focused applications.

GLM-4.5 denotes an open-source, large-scale Mixture-of-Experts (MoE) foundation model architecture explicitly optimized for agentic, reasoning, and coding (ARC) tasks. It incorporates a hybrid reasoning paradigm, multi-stage MoE-centric training across 23T tokens, rigorous expert model iteration, and reinforcement learning alignment. The full model features 355B total parameters with only 32B activated per forward pass, supporting both deep “thinking” and direct response modes. GLM-4.5 achieves high performance across ARC benchmarks with superior parameter efficiency relative to comparably ranked competitors and offers a compact, research-oriented variant GLM-4.5-Air (106B/12B).

1. Model Architecture and MoE Design

GLM-4.5 is formulated as a sparse Mixture-of-Experts transformer with the following defining characteristics:

Parameters and Activation: 355B total parameters, with only 32B activated per token. The compact variant (GLM-4.5-Air) uses 106B (12B activated).
MoE Layer Routing: Loss-free balance routing using sigmoid gating for expert selection throughout MoE layers. This routing aims to maintain expert load balance and is structurally distinct from previous GLM generations.
Width and Depth: Relative to previous models, GLM-4.5 reduces model width (smaller hidden dimension and fewer experts per layer) and increases depth (more layers). Empirical observations indicate that model “height” is strongly predictive of improved reasoning abilities.
Attention and Positional Encoding:
- Grouped-Query Attention (GQA) is employed for efficient context handling.
- Partial Rotary Position Embeddings (RoPE) extend contextualization over long-range token dependencies.
- The model features high attention head counts (e.g., 96 heads for d_hidden = 5120) and incorporates QK-Norm to stabilize attention logits.
Speculative Decoding and MTP: An MoE-based Multi-Token Prediction (MTP) layer facilitates efficient speculative decoding for parallel token generation during inference.

Significance: This architectural design yields a high-compute, high-capacity transformer that remains tractable through dynamic sparsity. The increased layer count coupled with MoE sparsity delivers competitive or superior reasoning and agentic task performance without the cost scaling of traditional dense models.

2. Hybrid Reasoning Capability

A central innovation is the hybrid reasoning method:

Thinking Mode: The model can perform internal, multi-step deliberation (e.g., via extended Chain-of-Thought outputs) for complex reasoning, mathematics, and code generation.
Direct Response Mode: For queries that do not require step-wise logic, the model quickly generates succinct responses without context expansion.
Training Paradigm: Hybrid reasoning ability is acquired by balancing supervised fine-tuning datasets containing both fully explicit reasoning traces and direct answers, enabling the model to select the most task-appropriate reasoning form at inference.

Contextual Importance: This property reflects a dual-cognitive paradigm. The ability to efficiently interleave or select between depth-reasoning and rapid inference is particularly advantageous for LLM-as-agent applications where flexible context management and rapid decision-making are prized.

3. Multi-Stage Pretraining and Long-Context Alignment

GLM-4.5 is trained across several sequential stages:

Massive Pretraining: The initial stage ingests 23T tokens sourced from the web, books, social media, and multilingual/code corpora. Sequence lengths are incrementally increased (from 4K to 32K and finally 128K) to instill long-context capabilities.
Mid-Training: Domain-specific data (e.g., programming code, mathematical reasoning, scientific literature) are up-sampled during mid-training phases.
Long-Context Conditioning: The architecture is explicitly aligned for handling extended contexts—supporting faithful reasoning and stateful agentic behaviors over extensive input histories.

Significance: These steps are foundational for enabling agentic trajectories and robust context tracking in agent and reasoning settings.

4. Post-Training: Expert Iteration and Reinforcement Learning

Post-training proceeds in two main stages:

Expert Model Iteration: Sub-networks (or “experts”) are supervised to specialize (e.g., in reasoning, agent/tool invocation, or conversational chat). This is achieved by SFT using training data with extended reasoning chains as well as realistic tool-use examples.
Unified Self-Distillation and RL Alignment: The specialized expert models are integrated via self-distillation, then further refined by reinforcement learning (RL) steps. Reward structures are diverse:
- For agentic tasks, mean reward maximization is used:
$L_{RL}(\theta) = \mathbb{E}_{x \sim \mathcal{D}}\left[\frac{1}{K} \sum_{i=1}^K \left( r(x, y_i) - \bar{r}(x) \right) \right]$

with $\bar{r}(x)$ the mean reward across sampled outputs. - For function-calling/tool-use tasks, a binary reward is used:

$Reward = \begin{cases} 1, & \text{FormatCorrect}(a_t) \ %%%%2%%%%\ \text{Match}(a_t, a^*_t) \ 0, & \text{otherwise} \end{cases}$

where $a_t$ is the model action and $a^*_t$ ground truth.
Iterative Enhancement: Multiple cycles of self-distillation and RL improve the model’s reasoning, coding, and agentic policy decisions—enabling reliable planning, tool usage, and structured output.

Significance: RL and self-distillation over expert-bootstrapped outputs directly contribute to the model’s ability to self-improve in agentic and multi-modal workflows.

5. ARC Benchmark Performance and Parameter Efficiency

Comprehensive evaluation establishes GLM-4.5 near the state-of-the-art:

Benchmark	Score	Task Domain
TAU-Bench	70.1%	Agentic
AIME 24	91.0%	Reasoning
SWE-bench Verified	64.2%	Coding

Ranking: 3rd overall on aggregate ARC task performance; 2nd place on agentic benchmarks among models evaluated.
Parameter Efficiency: Outperforms or competes with much larger models (e.g., DeepSeek-R1 at 671B and Kimi K2 at 1043B) with significantly fewer activated parameters (32B).
Compact Variant: GLM-4.5-Air (106B/12B) supports resource-constrained research use cases with minimal loss in performance, extending accessibility.

Significance: These numerical results demonstrate that MoE-based hybrid reasoning models can achieve high performance with reduced inference and storage cost, even at large parameter scales.

6. Implications for Agentic AI, Reasoning, and Tool Use

GLM-4.5’s design and training pipeline enable:

Agentic Behavior: Integrating with external APIs (web search, code execution, tool use), the model plans and executes complex trajectories.
Long-Context and Multi-Step Reasoning: The architecture supports agent operations that maintain state, plan multi-turn strategies, and adaptively select response depth.
Parameter Efficiency for Research: GLM-4.5-Air (“editor’s term”: compact MoE variant) enables cost-effective experimentation in extended context, agentic simulation, and tool-use benchmarks.
Research Testbed: Iterative RL and expert unification provide a platform for studying hybrid reasoning optimization, agentic policies, and the practical impacts of expert–height/width trade-offs.

This suggests the GLM-4.5 approach is poised to advance large-scale agentic reasoning—both in academic research focused on efficiency and flexibility, and in applied settings requiring high reliability with tractable resource consumption.

7. Future Directions and Open Research

Potential future advances indicated by the GLM-4.5 framework include:

Further refinement of hybrid reasoning: dynamically determining “thinking” versus “direct” response mode conditioned on task structure and user intent.
Longer context adaptation: progressively stretching sequence length and optimizing position encoding for agentic, multi-modal, and write-long problems.
Next-generation agentic RL: integrating more realistic environment feedback and toolchains to model multi-agent systems, planning under uncertainty, or continually learning behaviors.

A plausible implication is that advances realized within GLM-4.5 may generalize to broader classes of MoE-based, hybrid-reasoning architectures, setting new directions for research in parameter-efficient foundation models, RL-enhanced reasoning, and scalable agentic AI.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to GLM-4.5.