GLM-4.5-Air: Compact MoE LLM for ARC Tasks

Updated 11 August 2025

GLM-4.5-Air is a compact, open-source mixture-of-experts large language model characterized by selective expert activation, innovative transformer design, and multi-stage training.
It delivers competitive performance on reasoning, coding, and agentic tasks by employing dynamic expert routing and hybrid inference modes.
Its flexible training pipeline and efficient architecture make it ideal for real-world applications like coding assistance, long-context summarization, and tool-directed operations.

GLM-4.5-Air is a compact, open-source Mixture-of-Experts (MoE) LLM developed as part of the GLM-4.5 series with emphasis on agentic, reasoning, and coding (ARC) tasks. Designed to deliver substantial reasoning and agentic ability with high parameter efficiency, GLM-4.5-Air implements key innovations in MoE transformer architecture, progressive multi-stage training, and task-specific post-training. It is optimized to serve both as a high-performance research benchmark and a practical agentic LLM for real-world applications.

1. Model Architecture and MoE Design

GLM-4.5-Air employs a compact transformer-based MoE architecture comprising 106 billion total parameters with 12 billion activated parameters per forward pass. The model integrates 45 MoE transformer layers, each containing 128 expert modules. Only 8 experts are routed per token, selected via learned sigmoid gating mechanisms with loss-free balance routing to maintain uniform expert utilization. Unlike the full GLM-4.5 variant (355B parameters, 32B activated), GLM-4.5-Air reduces hidden dimensions and dense layers (only a single dense layer versus three), but maintains a deep layer stack to maximize reasoning capability.

Key architectural features include:

Grouped-Query Attention with partial rotary positional encodings (RoPE), which enhances long context handling.
Expanded Attention Heads: 96 heads, head dimension 128; empirically shown to boost reasoning accuracy.
MoE Multi-Token Prediction (MTP) Layer for the output: supports speculative decoding.
Flexible operation modes: hybrid "thinking" (deliberative) and "direct response" (fast inference) modes to suit different task demands.
Mixture-of-experts computation is abstractly represented as:

$y = \sum_{i=1}^{E} G(x)_i f_i(x)$

where $x$ is input, $f_i$ are expert transformations, $G(x)_i$ are gating weights (nonzero for 8 routed experts), and $E=128$ .

2. Multi-Stage Training and Post-Training Procedures

GLM-4.5-Air training is characterized by a tiered strategy:

Pre-training utilizes a 23 trillion token corpus, with a maximum sequence length of 4,096. The objective focuses on language modeling across a wide range of domains, including general web, code, math, and science content.
Mid-training targets reasoning and agentic skills by introducing curated domain and instruction data, expanding sequence lengths to 32K and eventually 128K tokens using best-fit data packing. This phase is critical for extended context reasoning and denser instruction chains.
Post-training includes:
- Supervised Fine-Tuning (SFT): Long chain-of-thought completions, agentic dialog templates, and explicit function call data.
- Domain-Specific Reinforcement Learning (RL): Fine-tuning on reasoning/coding (AIME, SWE-bench) and tool-using tasks (function/terminal calls, web browsing).
- Expert Iteration and Self-Distillation: Aggregates skills from various fine-tuned models into a unified agent capable of both slow, deliberative "thought" and immediate direct responses.

The training pipeline maximizes coverage over reasoning-intensive and agentic patterns via dynamic expert routing and extended context learning.

3. Performance on Reasoning, Coding, and Agentic Tasks

GLM-4.5-Air demonstrates competitive performance relative to much larger models across ARC benchmarks:

TAU-Bench (agentic tasks): Scores 77.9% (TAU-Retail) and 60.8% (TAU-Airline), confirming strong agent behavior and external tool handling capability.
AIME 24 (mathematical reasoning): Achieves 89.4% (vs. 91.0% for GLM-4.5). This demonstrates robust quantitative reasoning.
SWE-bench Verified (coding): Scores 57.6%. Although below the full 355B GLM-4.5 (64.2%), it surpasses many larger-scale open and proprietary models in code generation/modification tasks.
Aggregate ARC Ranking: GLM-4.5-Air is ranked 6th overall across combined agentic, reasoning, and coding evaluations—remarkable given the parameter budget.
Performance is consistently above other 100B-scale models and matches or exceeds baseline performance in common agentic or reasoning benchmarks.

4. Applications and Use Cases

GLM-4.5-Air is optimized for scenarios that require both multi-step reasoning and complex agentic actions:

Hybrid Reasoning System: Supports both "chain-of-thought" generation and rapid direct answer completion. The model can dynamically alternate between slow, multi-turn deliberation and expedited completion depending on prompt context and instruction format.
Agentic Task Execution: Excels at explicit function calls, code execution, and web browsing; includes explicit output format constraints and penalties to ensure process alignment (e.g., correct tool call syntax).
Coding Assistance: Performs end-to-end GitHub issue resolution and codebase modifications, making it well suited for integration into continuous integration/continuous deployment (CI/CD) workflows and advanced code reasoning assistants.
Long Context Tasks: Demonstrates high fidelity in summarizing lengthy documents, multi-domain translations, and contextual chain-of-thought explanations with sequence lengths up to 128K tokens (during RL-enhanced stages).

5. Availability, Reproducibility, and Integration Resources

Open Access: Model weights, codebase, and detailed documentation are available through multiple platforms (e.g., Z.ai, BigModel.cn, Hugging Face at https://huggingface.co/zai-org/GLM-4.5).
Evaluation Toolkit: A standardized evaluation toolkit for ARC benchmarks and custom agentic or reasoning tasks is open-sourced (https://github.com/zai-org/glm-simple-evals).
Usage Documentation: Release includes turnkey inference pipelines, guidance for extended context operation, environment setup scripts, and fine-tuning protocols for custom domains and tasks.
Supporting Materials: Research artifacts and replication instructions are included to promote transparent benchmarking and community-driven research.

6. Comparative Analysis and Broader Significance

Relative to both the predecessor GLM-4.5 (355B) and contemporary 100B/175B models, GLM-4.5-Air achieves substantial efficiency and flexibility:

Efficiency: The one-dense-layer design, selective expert activation, and lightweight routing enable inference costs well below those of monolithic dense-model counterparts. This makes the model suitable for real-time and resource-constrained deployments.
Performance: Despite parameter compression, GLM-4.5-Air closely tracks the full GLM-4.5 in reasoning (AIME), coding (SWE-bench), and agentic tasks (TAU-bench). It surpasses comparable 100B+ open models and even rivals larger proprietary LLMs in several domains.
Agentic Capabilities: The hybrid reasoning mode and highly structured post-training make the model adept at multi-step planning, tool use, and context-sensitive agentic decision-making, a property uncommon in models at this parameter scale.

This model represents a significant advance in efficient, multitask LLMs and provides a reproducible and extensible foundation for research in agentic and reasoning-intensive natural language processing (Team et al., 8 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models (2025)

Follow Topic

Get notified by email when new papers are published related to GLM-4.5-Air.