Papers
Topics
Authors
Recent
Search
2000 character limit reached

GLM-4 All Tools: Autonomous Multimodal LLM

Updated 2 March 2026
  • GLM-4 All Tools is an advanced large language model designed for autonomous tool invocation and extended context processing.
  • It employs a Transformer encoder-decoder with innovations like 2D-RoPE, group-query attention, and native XML-based tool integration.
  • Benchmark evaluations demonstrate state-of-the-art performance in multilingual, multimodal, and agentic workflows across diverse tasks.

GLM-4 All Tools Model refers to an advanced LLM system in the ChatGLM family, designed for autonomous tool use, long-context understanding, and strong multilingual and multimodal task performance. Building on the architectural innovations and training methodologies of the GLM-4 series, the All Tools variant is explicitly aligned to invoke external functionality—such as web browsing, code execution, and user-defined APIs—in a manner closely integrated with its token generation. The GLM-4 All Tools system is open-sourced by Zhipu AI and THUDM and is positioned as a leading open LLM in benchmarks spanning reasoning, instruction following, code generation, and agentic workflows (GLM et al., 2024, Team et al., 1 Jul 2025).

1. Model Architecture and Pretraining

GLM-4 All Tools is based on a Transformer encoder–decoder backbone with several significant modifications relative to canonical GPT architectures. These include the exclusive use of biases in Q/K/V matrices within multi-query attention, replacement of LayerNorm with RMSNorm, adoption of SwiGLU activation in the FFN stack, and utilization of rotary positional embeddings (RoPE) extended to two dimensions for long-context processing. Group-Query Attention (GQA) is used instead of Multi-Head Attention (MHA) to reduce key–value (KV) cache size, enabling greater scalability within the fixed ~130B parameter regime by increasing the FFN dimension to 103×\tfrac{10}{3} \times hidden size (GLM et al., 2024).

Pretraining is conducted on approximately 10 trillion tokens, with the dominant languages being Chinese and English and supporting corpora from 24 additional languages. Data sources include filtered and deduplicated web pages, Wikipedia, books, code repositories, and academic literature. GLM-4 All Tools employs a byte-level BPE vocabulary (merged with cl100k_base, 150K tokens), with a context window of 128K tokens (and up to 1M tokens in experimental settings) (GLM et al., 2024, Team et al., 1 Jul 2025).

The pretraining objective is an autoregressive blank-infilling task as defined in GLM-130B:

Lpre=E(x,y)Dt=1Tlogpθ(yty<t,x)\mathcal{L}_{\mathrm{pre}} = -\mathbb{E}_{(x,y)\sim\mathcal{D}} \sum_{t=1}^{T} \log p_\theta(y_t \mid y_{<t}, x)

2. Post-Training Alignment and Reinforcement Learning

GLM-4 All Tools alignment is performed in a multi-stage protocol. First, supervised fine-tuning (SFT) is applied over a corpus of human-authored prompt–response pairs encompassing safety, factuality, relevance, and helpfulness. This SFT teaches instruction following and multi-turn dialogue in both Chinese and English (GLM et al., 2024).

Subsequently, Reinforcement Learning from Human Feedback (RLHF) is introduced. Annotators score response pairs across relevant dimensions; a reward model predicts these scores, and the final policy πθ\pi_\theta is optimized using Proximal Policy Optimization (PPO) with a KL-penalty anchoring to a reference model:

LPPO=Eτπθ[rϕ(τ)βKL[πθ(x)πref(x)]]\mathcal{L}_{\mathrm{PPO}} = \mathbb{E}_{\tau\sim\pi_\theta} \left[ r_\phi(\tau) - \beta\,\mathrm{KL}\left[\pi_\theta(\cdot \mid x) \| \pi_{\mathrm{ref}}(\cdot \mid x)\right] \right]

GLM-4.6V, the direct "All Tools" multimodal extension, applies a reasoning-centric reinforcement learning framework with Curriculum Sampling (RLCS). Offline and online difficulty grading stratifies samples, favoring mid-difficulty instances. Dynamic sampling expansion addresses class imbalances with

expansion_ratiot=11not_valid_sample_ratet1,\mathrm{expansion\_ratio}_t = \frac{1}{1 - \mathrm{not\_valid\_sample\_rate}_{t-1}},

with exponential moving average smoothing. The RL objective omits KL/entropy regularization, instead focusing on GRPO (Generalized Reinforce with Policy Optimization):

LGRPO(θ)=Eτπθ[r(τ)tπθ(atst)πold(atst)].L_{\text{GRPO}}(\theta) = -\mathbb{E}_{\tau\sim\pi_\theta} \left[ r(\tau) \prod_{t} \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\text{old}}(a_t \mid s_t)} \right].

(Team et al., 1 Jul 2025)

3. Tool Integration and Invocation Protocol

GLM-4 All Tools implements a native tool use protocol, with explicit alignment to decide autonomously when and which tool(s) to invoke—choices include a web browser, Python interpreter, text-to-image model, and user-defined functions. The mechanism is as follows:

  • The model examines the user’s query xx and its internal plan π\pi, computing

p(toolcontext)=softmax(fθ(context))p(\mathrm{tool}\mid\mathrm{context}) = \mathrm{softmax}(f_\theta(\mathrm{context}))

over discrete tool options.

  • During generation, emission of a special <CALL_TOOL:tool_i> token triggers the model to pause, invoke APItooli_{\text{tool}_i}(context), and append the returned result Lpre=E(x,y)Dt=1Tlogpθ(yty<t,x)\mathcal{L}_{\mathrm{pre}} = -\mathbb{E}_{(x,y)\sim\mathcal{D}} \sum_{t=1}^{T} \log p_\theta(y_t \mid y_{<t}, x)0 to its context before continuing.
  • This process allows multi-stage workflows, such as web-augmented question answering or code-execution-based computation (GLM et al., 2024).

In GLM-4.6V, the protocol is formalized by XML-style tags within the context: Lpre=E(x,y)Dt=1Tlogpθ(yty<t,x)\mathcal{L}_{\mathrm{pre}} = -\mathbb{E}_{(x,y)\sim\mathcal{D}} \sum_{t=1}^{T} \log p_\theta(y_t \mid y_{<t}, x)1 A dedicated tool manager processes the <tool_call> blocks during inference, making API calls and injecting results as contextual feedback. No tokenizer modifications or hacks are required; the model learns to emit well-formed XML tool calls natively (Team et al., 1 Jul 2025).

4. Long-Context Handling and Sequence Modeling

Extended context support is implemented via 2D-RoPE positional encodings that generalize rotary embeddings to row and column granularity, allowing the model to process sequences up to 128K (and experimentally 1M) tokens. Second-stage continual training ramps up supported context lengths (from 8K to 32K and finally 131,072 tokens for GLM-4.6V) and expands context-parallel processing width by 4× to manage long-sequence memory and computation.

FlashAttention is used for efficient attention computation, and GQA reduces the overall KV cache and memory footprint. These design choices enable the model to maintain entire histories of tool invocations, code revisions, GUI states, or multi-hop CoT chains in a single rollout (GLM et al., 2024, Team et al., 1 Jul 2025).

5. Empirical Performance and Benchmarks

GLM-4 All Tools has been evaluated on a broad set of academic and end-to-end agentic benchmarks, consistently demonstrating state-of-the-art or competitive results among open LLMs. In direct comparison to closed-source models such as GPT-4, GPT-4 Turbo, and Gemini-2.5-Flash, the model frequently matches or surpasses baselines on alignment, reasoning, function-calling, and code execution tasks (GLM et al., 2024, Team et al., 1 Jul 2025).

Tool-Based Task Results (GLM-4.6V):

Task GLM-4.6V (106B) Qwen2.5-VL-72B Gemini-2.5-Flash
Coding (Design2Code, UI ≥80%) 88.6% 41.9% 34.1%
OSWorld (100 steps, GUI Agent) 37.2% 8.8%
WebQuest SingleQA 79.5% 60.5%
OCRBench (Text Extraction) 86.5% 85.1%
ChartQAPro (Chart Reasoning) 65.5% 46.7%

General Academic Benchmarks (GLM-4):

Model MMLU GSM8K MATH BBH GPQA HumanEval
GPT-4 Turbo (2024-04-09) 86.7 95.6 73.4 88.2 49.3 88.2
GLM-4 (0520) 83.3 93.3 61.3 84.7 39.9 78.5

On end-to-end web browsing plus code-based math (e.g., GSM8K reasoning with Python tool use), GLM-4 All Tools (91.59%) performs comparably to GPT-4 (92.72%). In browser-based information seeking, it surpasses GPT-4 All Tools (78.08% vs. 67.12%) (GLM et al., 2024).

6. Practical Demonstrations and Open-Source Access

The agentic chain-of-thought and tool-use workflow is illustrated by complex multistep plans, such as "search for the global population from 2000 to 2023, then calculate the average annual growth rate." GLM-4 All Tools executes: (1) web browser retrieval, (2) Python computation, (3) answer synthesis—often matching or surpassing GPT-4 All Tools in such workflows (GLM et al., 2024).

Open-source releases from Zhipu AI and THUDM include GLM-4-9B for standard LLM usage, GLM-4V-9B for vision tasks, WebGLM for web-augmented agents, and CodeGeeX for code generation, collectively reaching over 10 million downloads in 2023 (GLM et al., 2024).

7. Significance and Research Context

GLM-4 All Tools and its multimodal extensions (e.g., GLM-4.6V) contribute a unified architecture combining large-scale language and vision modeling, explicit agentic tool invocation, and extended context scaling to the open-source LLM ecosystem. The native XML-driven function-call protocol, curriculum-centric RL alignment, and direct competitive results on tool-augmented benchmarks position the GLM-4 All Tools lineage as a reference agentic LLM system for research and applied use in web, code, GUI, and multimodal environments (GLM et al., 2024, Team et al., 1 Jul 2025).

References: (GLM et al., 2024, Team et al., 1 Jul 2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GLM-4 All Tools Model.