GLM-4.5: Sparse MoE Model for ARC Tasks
- GLM-4.5 is a large-scale, sparse Mixture-of-Experts model that uses hybrid reasoning modes to efficiently tackle agentic, reasoning, and coding tasks.
- It employs a multi-stage training pipeline with 23T tokens, layered MoE routing, and reinforcement learning to achieve precise expert iteration and adaptive performance.
- The model achieves high ARC benchmark scores with superior parameter efficiency and features a compact GLM-4.5-Air variant for research-focused applications.
GLM-4.5 denotes an open-source, large-scale Mixture-of-Experts (MoE) foundation model architecture explicitly optimized for agentic, reasoning, and coding (ARC) tasks. It incorporates a hybrid reasoning paradigm, multi-stage MoE-centric training across 23T tokens, rigorous expert model iteration, and reinforcement learning alignment. The full model features 355B total parameters with only 32B activated per forward pass, supporting both deep “thinking” and direct response modes. GLM-4.5 achieves high performance across ARC benchmarks with superior parameter efficiency relative to comparably ranked competitors and offers a compact, research-oriented variant GLM-4.5-Air (106B/12B).
1. Model Architecture and MoE Design
GLM-4.5 is formulated as a sparse Mixture-of-Experts transformer with the following defining characteristics:
- Parameters and Activation: 355B total parameters, with only 32B activated per token. The compact variant (GLM-4.5-Air) uses 106B (12B activated).
- MoE Layer Routing: Loss-free balance routing using sigmoid gating for expert selection throughout MoE layers. This routing aims to maintain expert load balance and is structurally distinct from previous GLM generations.
- Width and Depth: Relative to previous models, GLM-4.5 reduces model width (smaller hidden dimension and fewer experts per layer) and increases depth (more layers). Empirical observations indicate that model “height” is strongly predictive of improved reasoning abilities.
- Attention and Positional Encoding:
- Grouped-Query Attention (GQA) is employed for efficient context handling.
- Partial Rotary Position Embeddings (RoPE) extend contextualization over long-range token dependencies.
- The model features high attention head counts (e.g., 96 heads for d_hidden = 5120) and incorporates QK-Norm to stabilize attention logits.
- Speculative Decoding and MTP: An MoE-based Multi-Token Prediction (MTP) layer facilitates efficient speculative decoding for parallel token generation during inference.
Significance: This architectural design yields a high-compute, high-capacity transformer that remains tractable through dynamic sparsity. The increased layer count coupled with MoE sparsity delivers competitive or superior reasoning and agentic task performance without the cost scaling of traditional dense models.
2. Hybrid Reasoning Capability
A central innovation is the hybrid reasoning method:
- Thinking Mode: The model can perform internal, multi-step deliberation (e.g., via extended Chain-of-Thought outputs) for complex reasoning, mathematics, and code generation.
- Direct Response Mode: For queries that do not require step-wise logic, the model quickly generates succinct responses without context expansion.
- Training Paradigm: Hybrid reasoning ability is acquired by balancing supervised fine-tuning datasets containing both fully explicit reasoning traces and direct answers, enabling the model to select the most task-appropriate reasoning form at inference.
Contextual Importance: This property reflects a dual-cognitive paradigm. The ability to efficiently interleave or select between depth-reasoning and rapid inference is particularly advantageous for LLM-as-agent applications where flexible context management and rapid decision-making are prized.
3. Multi-Stage Pretraining and Long-Context Alignment
GLM-4.5 is trained across several sequential stages:
- Massive Pretraining: The initial stage ingests 23T tokens sourced from the web, books, social media, and multilingual/code corpora. Sequence lengths are incrementally increased (from 4K to 32K and finally 128K) to instill long-context capabilities.
- Mid-Training: Domain-specific data (e.g., programming code, mathematical reasoning, scientific literature) are up-sampled during mid-training phases.
- Long-Context Conditioning: The architecture is explicitly aligned for handling extended contexts—supporting faithful reasoning and stateful agentic behaviors over extensive input histories.
Significance: These steps are foundational for enabling agentic trajectories and robust context tracking in agent and reasoning settings.
4. Post-Training: Expert Iteration and Reinforcement Learning
Post-training proceeds in two main stages:
- Expert Model Iteration: Sub-networks (or “experts”) are supervised to specialize (e.g., in reasoning, agent/tool invocation, or conversational chat). This is achieved by SFT using training data with extended reasoning chains as well as realistic tool-use examples.
- Unified Self-Distillation and RL Alignment: The specialized expert models are integrated via self-distillation, then further refined by reinforcement learning (RL) steps. Reward structures are diverse:
- For agentic tasks, mean reward maximization is used:
with the mean reward across sampled outputs. - For function-calling/tool-use tasks, a binary reward is used:
$Reward = \begin{cases} 1, & \text{FormatCorrect}(a_t) \ %%%%2%%%%\ \text{Match}(a_t, a^*_t) \ 0, & \text{otherwise} \end{cases}$
where is the model action and ground truth.
- Iterative Enhancement: Multiple cycles of self-distillation and RL improve the model’s reasoning, coding, and agentic policy decisions—enabling reliable planning, tool usage, and structured output.
Significance: RL and self-distillation over expert-bootstrapped outputs directly contribute to the model’s ability to self-improve in agentic and multi-modal workflows.
5. ARC Benchmark Performance and Parameter Efficiency
Comprehensive evaluation establishes GLM-4.5 near the state-of-the-art:
Benchmark | Score | Task Domain |
---|---|---|
TAU-Bench | 70.1% | Agentic |
AIME 24 | 91.0% | Reasoning |
SWE-bench Verified | 64.2% | Coding |
- Ranking: 3rd overall on aggregate ARC task performance; 2nd place on agentic benchmarks among models evaluated.
- Parameter Efficiency: Outperforms or competes with much larger models (e.g., DeepSeek-R1 at 671B and Kimi K2 at 1043B) with significantly fewer activated parameters (32B).
- Compact Variant: GLM-4.5-Air (106B/12B) supports resource-constrained research use cases with minimal loss in performance, extending accessibility.
Significance: These numerical results demonstrate that MoE-based hybrid reasoning models can achieve high performance with reduced inference and storage cost, even at large parameter scales.
6. Implications for Agentic AI, Reasoning, and Tool Use
GLM-4.5’s design and training pipeline enable:
- Agentic Behavior: Integrating with external APIs (web search, code execution, tool use), the model plans and executes complex trajectories.
- Long-Context and Multi-Step Reasoning: The architecture supports agent operations that maintain state, plan multi-turn strategies, and adaptively select response depth.
- Parameter Efficiency for Research: GLM-4.5-Air (“editor’s term”: compact MoE variant) enables cost-effective experimentation in extended context, agentic simulation, and tool-use benchmarks.
- Research Testbed: Iterative RL and expert unification provide a platform for studying hybrid reasoning optimization, agentic policies, and the practical impacts of expert–height/width trade-offs.
This suggests the GLM-4.5 approach is poised to advance large-scale agentic reasoning—both in academic research focused on efficiency and flexibility, and in applied settings requiring high reliability with tractable resource consumption.
7. Future Directions and Open Research
Potential future advances indicated by the GLM-4.5 framework include:
- Further refinement of hybrid reasoning: dynamically determining “thinking” versus “direct” response mode conditioned on task structure and user intent.
- Longer context adaptation: progressively stretching sequence length and optimizing position encoding for agentic, multi-modal, and write-long problems.
- Next-generation agentic RL: integrating more realistic environment feedback and toolchains to model multi-agent systems, planning under uncertainty, or continually learning behaviors.
A plausible implication is that advances realized within GLM-4.5 may generalize to broader classes of MoE-based, hybrid-reasoning architectures, setting new directions for research in parameter-efficient foundation models, RL-enhanced reasoning, and scalable agentic AI.