Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
91 tokens/sec
Gemini 2.5 Pro Premium
51 tokens/sec
GPT-5 Medium
32 tokens/sec
GPT-5 High Premium
24 tokens/sec
GPT-4o
86 tokens/sec
DeepSeek R1 via Azure Premium
95 tokens/sec
GPT OSS 120B via Groq Premium
460 tokens/sec
Kimi K2 via Groq Premium
208 tokens/sec
2000 character limit reached

GLM-4.5-Air: Compact MoE LLM for ARC Tasks

Updated 11 August 2025
  • GLM-4.5-Air is a compact, open-source mixture-of-experts large language model characterized by selective expert activation, innovative transformer design, and multi-stage training.
  • It delivers competitive performance on reasoning, coding, and agentic tasks by employing dynamic expert routing and hybrid inference modes.
  • Its flexible training pipeline and efficient architecture make it ideal for real-world applications like coding assistance, long-context summarization, and tool-directed operations.

GLM-4.5-Air is a compact, open-source Mixture-of-Experts (MoE) LLM developed as part of the GLM-4.5 series with emphasis on agentic, reasoning, and coding (ARC) tasks. Designed to deliver substantial reasoning and agentic ability with high parameter efficiency, GLM-4.5-Air implements key innovations in MoE transformer architecture, progressive multi-stage training, and task-specific post-training. It is optimized to serve both as a high-performance research benchmark and a practical agentic LLM for real-world applications.

1. Model Architecture and MoE Design

GLM-4.5-Air employs a compact transformer-based MoE architecture comprising 106 billion total parameters with 12 billion activated parameters per forward pass. The model integrates 45 MoE transformer layers, each containing 128 expert modules. Only 8 experts are routed per token, selected via learned sigmoid gating mechanisms with loss-free balance routing to maintain uniform expert utilization. Unlike the full GLM-4.5 variant (355B parameters, 32B activated), GLM-4.5-Air reduces hidden dimensions and dense layers (only a single dense layer versus three), but maintains a deep layer stack to maximize reasoning capability.

Key architectural features include:

  • Grouped-Query Attention with partial rotary positional encodings (RoPE), which enhances long context handling.
  • Expanded Attention Heads: 96 heads, head dimension 128; empirically shown to boost reasoning accuracy.
  • MoE Multi-Token Prediction (MTP) Layer for the output: supports speculative decoding.
  • Flexible operation modes: hybrid "thinking" (deliberative) and "direct response" (fast inference) modes to suit different task demands.
  • Mixture-of-experts computation is abstractly represented as:

y=i=1EG(x)ifi(x)y = \sum_{i=1}^{E} G(x)_i f_i(x)

where xx is input, fif_i are expert transformations, G(x)iG(x)_i are gating weights (nonzero for 8 routed experts), and E=128E=128.

2. Multi-Stage Training and Post-Training Procedures

GLM-4.5-Air training is characterized by a tiered strategy:

  • Pre-training utilizes a 23 trillion token corpus, with a maximum sequence length of 4,096. The objective focuses on LLMing across a wide range of domains, including general web, code, math, and science content.
  • Mid-training targets reasoning and agentic skills by introducing curated domain and instruction data, expanding sequence lengths to 32K and eventually 128K tokens using best-fit data packing. This phase is critical for extended context reasoning and denser instruction chains.
  • Post-training includes:
    • Supervised Fine-Tuning (SFT): Long chain-of-thought completions, agentic dialog templates, and explicit function call data.
    • Domain-Specific Reinforcement Learning (RL): Fine-tuning on reasoning/coding (AIME, SWE-bench) and tool-using tasks (function/terminal calls, web browsing).
    • Expert Iteration and Self-Distillation: Aggregates skills from various fine-tuned models into a unified agent capable of both slow, deliberative "thought" and immediate direct responses.

The training pipeline maximizes coverage over reasoning-intensive and agentic patterns via dynamic expert routing and extended context learning.

3. Performance on Reasoning, Coding, and Agentic Tasks

GLM-4.5-Air demonstrates competitive performance relative to much larger models across ARC benchmarks:

  • TAU-Bench (agentic tasks): Scores 77.9% (TAU-Retail) and 60.8% (TAU-Airline), confirming strong agent behavior and external tool handling capability.
  • AIME 24 (mathematical reasoning): Achieves 89.4% (vs. 91.0% for GLM-4.5). This demonstrates robust quantitative reasoning.
  • SWE-bench Verified (coding): Scores 57.6%. Although below the full 355B GLM-4.5 (64.2%), it surpasses many larger-scale open and proprietary models in code generation/modification tasks.
  • Aggregate ARC Ranking: GLM-4.5-Air is ranked 6th overall across combined agentic, reasoning, and coding evaluations—remarkable given the parameter budget.
  • Performance is consistently above other 100B-scale models and matches or exceeds baseline performance in common agentic or reasoning benchmarks.

4. Applications and Use Cases

GLM-4.5-Air is optimized for scenarios that require both multi-step reasoning and complex agentic actions:

  • Hybrid Reasoning System: Supports both "chain-of-thought" generation and rapid direct answer completion. The model can dynamically alternate between slow, multi-turn deliberation and expedited completion depending on prompt context and instruction format.
  • Agentic Task Execution: Excels at explicit function calls, code execution, and web browsing; includes explicit output format constraints and penalties to ensure process alignment (e.g., correct tool call syntax).
  • Coding Assistance: Performs end-to-end GitHub issue resolution and codebase modifications, making it well suited for integration into continuous integration/continuous deployment (CI/CD) workflows and advanced code reasoning assistants.
  • Long Context Tasks: Demonstrates high fidelity in summarizing lengthy documents, multi-domain translations, and contextual chain-of-thought explanations with sequence lengths up to 128K tokens (during RL-enhanced stages).

5. Availability, Reproducibility, and Integration Resources

  • Open Access: Model weights, codebase, and detailed documentation are available through multiple platforms (e.g., Z.ai, BigModel.cn, Hugging Face at https://huggingface.co/zai-org/GLM-4.5).
  • Evaluation Toolkit: A standardized evaluation toolkit for ARC benchmarks and custom agentic or reasoning tasks is open-sourced (https://github.com/zai-org/glm-simple-evals).
  • Usage Documentation: Release includes turnkey inference pipelines, guidance for extended context operation, environment setup scripts, and fine-tuning protocols for custom domains and tasks.
  • Supporting Materials: Research artifacts and replication instructions are included to promote transparent benchmarking and community-driven research.

6. Comparative Analysis and Broader Significance

Relative to both the predecessor GLM-4.5 (355B) and contemporary 100B/175B models, GLM-4.5-Air achieves substantial efficiency and flexibility:

  • Efficiency: The one-dense-layer design, selective expert activation, and lightweight routing enable inference costs well below those of monolithic dense-model counterparts. This makes the model suitable for real-time and resource-constrained deployments.
  • Performance: Despite parameter compression, GLM-4.5-Air closely tracks the full GLM-4.5 in reasoning (AIME), coding (SWE-bench), and agentic tasks (TAU-bench). It surpasses comparable 100B+ open models and even rivals larger proprietary LLMs in several domains.
  • Agentic Capabilities: The hybrid reasoning mode and highly structured post-training make the model adept at multi-step planning, tool use, and context-sensitive agentic decision-making, a property uncommon in models at this parameter scale.

This model represents a significant advance in efficient, multitask LLMs and provides a reproducible and extensible foundation for research in agentic and reasoning-intensive natural language processing (Team et al., 8 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube