GLM-4.5-Air: Compact MoE LLM for ARC Tasks
- GLM-4.5-Air is a compact, open-source mixture-of-experts large language model characterized by selective expert activation, innovative transformer design, and multi-stage training.
- It delivers competitive performance on reasoning, coding, and agentic tasks by employing dynamic expert routing and hybrid inference modes.
- Its flexible training pipeline and efficient architecture make it ideal for real-world applications like coding assistance, long-context summarization, and tool-directed operations.
GLM-4.5-Air is a compact, open-source Mixture-of-Experts (MoE) LLM developed as part of the GLM-4.5 series with emphasis on agentic, reasoning, and coding (ARC) tasks. Designed to deliver substantial reasoning and agentic ability with high parameter efficiency, GLM-4.5-Air implements key innovations in MoE transformer architecture, progressive multi-stage training, and task-specific post-training. It is optimized to serve both as a high-performance research benchmark and a practical agentic LLM for real-world applications.
1. Model Architecture and MoE Design
GLM-4.5-Air employs a compact transformer-based MoE architecture comprising 106 billion total parameters with 12 billion activated parameters per forward pass. The model integrates 45 MoE transformer layers, each containing 128 expert modules. Only 8 experts are routed per token, selected via learned sigmoid gating mechanisms with loss-free balance routing to maintain uniform expert utilization. Unlike the full GLM-4.5 variant (355B parameters, 32B activated), GLM-4.5-Air reduces hidden dimensions and dense layers (only a single dense layer versus three), but maintains a deep layer stack to maximize reasoning capability.
Key architectural features include:
- Grouped-Query Attention with partial rotary positional encodings (RoPE), which enhances long context handling.
- Expanded Attention Heads: 96 heads, head dimension 128; empirically shown to boost reasoning accuracy.
- MoE Multi-Token Prediction (MTP) Layer for the output: supports speculative decoding.
- Flexible operation modes: hybrid "thinking" (deliberative) and "direct response" (fast inference) modes to suit different task demands.
- Mixture-of-experts computation is abstractly represented as:
where is input, are expert transformations, are gating weights (nonzero for 8 routed experts), and .
2. Multi-Stage Training and Post-Training Procedures
GLM-4.5-Air training is characterized by a tiered strategy:
- Pre-training utilizes a 23 trillion token corpus, with a maximum sequence length of 4,096. The objective focuses on LLMing across a wide range of domains, including general web, code, math, and science content.
- Mid-training targets reasoning and agentic skills by introducing curated domain and instruction data, expanding sequence lengths to 32K and eventually 128K tokens using best-fit data packing. This phase is critical for extended context reasoning and denser instruction chains.
- Post-training includes:
- Supervised Fine-Tuning (SFT): Long chain-of-thought completions, agentic dialog templates, and explicit function call data.
- Domain-Specific Reinforcement Learning (RL): Fine-tuning on reasoning/coding (AIME, SWE-bench) and tool-using tasks (function/terminal calls, web browsing).
- Expert Iteration and Self-Distillation: Aggregates skills from various fine-tuned models into a unified agent capable of both slow, deliberative "thought" and immediate direct responses.
The training pipeline maximizes coverage over reasoning-intensive and agentic patterns via dynamic expert routing and extended context learning.
3. Performance on Reasoning, Coding, and Agentic Tasks
GLM-4.5-Air demonstrates competitive performance relative to much larger models across ARC benchmarks:
- TAU-Bench (agentic tasks): Scores 77.9% (TAU-Retail) and 60.8% (TAU-Airline), confirming strong agent behavior and external tool handling capability.
- AIME 24 (mathematical reasoning): Achieves 89.4% (vs. 91.0% for GLM-4.5). This demonstrates robust quantitative reasoning.
- SWE-bench Verified (coding): Scores 57.6%. Although below the full 355B GLM-4.5 (64.2%), it surpasses many larger-scale open and proprietary models in code generation/modification tasks.
- Aggregate ARC Ranking: GLM-4.5-Air is ranked 6th overall across combined agentic, reasoning, and coding evaluations—remarkable given the parameter budget.
- Performance is consistently above other 100B-scale models and matches or exceeds baseline performance in common agentic or reasoning benchmarks.
4. Applications and Use Cases
GLM-4.5-Air is optimized for scenarios that require both multi-step reasoning and complex agentic actions:
- Hybrid Reasoning System: Supports both "chain-of-thought" generation and rapid direct answer completion. The model can dynamically alternate between slow, multi-turn deliberation and expedited completion depending on prompt context and instruction format.
- Agentic Task Execution: Excels at explicit function calls, code execution, and web browsing; includes explicit output format constraints and penalties to ensure process alignment (e.g., correct tool call syntax).
- Coding Assistance: Performs end-to-end GitHub issue resolution and codebase modifications, making it well suited for integration into continuous integration/continuous deployment (CI/CD) workflows and advanced code reasoning assistants.
- Long Context Tasks: Demonstrates high fidelity in summarizing lengthy documents, multi-domain translations, and contextual chain-of-thought explanations with sequence lengths up to 128K tokens (during RL-enhanced stages).
5. Availability, Reproducibility, and Integration Resources
- Open Access: Model weights, codebase, and detailed documentation are available through multiple platforms (e.g., Z.ai, BigModel.cn, Hugging Face at https://huggingface.co/zai-org/GLM-4.5).
- Evaluation Toolkit: A standardized evaluation toolkit for ARC benchmarks and custom agentic or reasoning tasks is open-sourced (https://github.com/zai-org/glm-simple-evals).
- Usage Documentation: Release includes turnkey inference pipelines, guidance for extended context operation, environment setup scripts, and fine-tuning protocols for custom domains and tasks.
- Supporting Materials: Research artifacts and replication instructions are included to promote transparent benchmarking and community-driven research.
6. Comparative Analysis and Broader Significance
Relative to both the predecessor GLM-4.5 (355B) and contemporary 100B/175B models, GLM-4.5-Air achieves substantial efficiency and flexibility:
- Efficiency: The one-dense-layer design, selective expert activation, and lightweight routing enable inference costs well below those of monolithic dense-model counterparts. This makes the model suitable for real-time and resource-constrained deployments.
- Performance: Despite parameter compression, GLM-4.5-Air closely tracks the full GLM-4.5 in reasoning (AIME), coding (SWE-bench), and agentic tasks (TAU-bench). It surpasses comparable 100B+ open models and even rivals larger proprietary LLMs in several domains.
- Agentic Capabilities: The hybrid reasoning mode and highly structured post-training make the model adept at multi-step planning, tool use, and context-sensitive agentic decision-making, a property uncommon in models at this parameter scale.
This model represents a significant advance in efficient, multitask LLMs and provides a reproducible and extensible foundation for research in agentic and reasoning-intensive natural language processing (Team et al., 8 Aug 2025).