Papers
Topics
Authors
Recent
Search
2000 character limit reached

AutoGLM: Autonomous GUI Control via RL

Updated 29 March 2026
  • AutoGLM is a family of large foundation agents that autonomously control digital devices through GUIs by integrating multi-turn, multi-task reinforcement learning and multimodal planning-grounding architectures.
  • Its modular three-stage architecture separates observation, high-level planning, and low-level grounding, achieving success rates up to 94.5% on complex real-world tasks.
  • A self-evolving curriculum RL framework combined with innovations such as cross-policy sampling and task-level advantage normalization drives scalable and robust GUI automation.

AutoGLM is a family of large foundation agent systems designed for autonomous control of digital devices through Graphical User Interfaces (GUIs), integrating advances in multi-turn, multi-task reinforcement learning and multimodal planning-grounding architectures. By building on state-of-the-art frameworks such as AgentRL, AutoGLM achieves high success rates on complex real-world GUI tasks, including web browsing and Android device control, through a modular, scalable, and curriculum-driven approach that decouples high-level planning from low-level grounding, leveraging asynchronous actor-critic RL at scale (Liu et al., 2024, Zhang et al., 5 Oct 2025).

1. System Architecture and Workflow

AutoGLM employs a modular three-stage architecture, crucially separating observation, high-level planning, and low-level action execution for robust GUI control (Liu et al., 2024). The pipeline is as follows:

  • Observation and Parsing: Given a raw screenshot (ItI_t), DOM/text tree (UtU_t), and past action history (HtH_t), a lightweight parser extracts GUI elements Et={ei}E_t = \{e_i\}, including id, role, text, and bounding-box.
  • Planning Module (πP\pi^P): A large (multi)modal LLM, derived from ChatGLM, processes an encoding of EtE_t and the user instruction, outputting an intermediate command a^t\hat{a}_t in a restricted domain-specific language (DSL) such as “CLICK id=23”.
  • Grounding Module (πG\pi^G): A vision-LLM maps the symbolic intermediate command a^t\hat{a}_t to a concrete pixel-space event or key-event, using the parsed GUI elements for spatial resolution.
  • Interaction and Feedback: The mapped action is executed in a simulated or real environment, observing the next state and reward, which feeds into an online RL trainer.

This pipeline allows for independent optimization and improvement of planning and grounding modules.

2. Underlying RL Formalism and Infrastructure

AutoGLM formalizes GUI interaction as a Markov decision process M=(S,A,P,R,γ)M = (S, A, P, R, \gamma):

  • SS: States st=(It,Ut,Ht)s_t = (I_t, U_t, H_t) represent all relevant GUI information.
  • AA: Actions are parameterized primitives such as Click(bbox)\mathrm{Click}(\mathrm{bbox}), Type()\mathrm{Type}(\text), Scroll(dir)\mathrm{Scroll}(\mathrm{dir}).
  • PP: Transitions determined by the GUI environment.
  • RR: Sparse outcome-based terminal reward, Rout(st,at)=1R_{\mathrm{out}}(s_t, a_t) = 1 if the goal is satisfied at the final state, $0$ otherwise, with optional process-supervised reward models.
  • Objective: Policy optimization via actor-critic RL, maximizing Eτπθ[t=0TγtR(st,at)]\mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \gamma^t R(s_t, a_t)\right] with a KL constraint to prevent catastrophic policy drift (Liu et al., 2024).

AgentRL is instantiated as the core RL backend. It decouples rollout actors and training optimizers via a fully-asynchronous pipeline across GPU clusters, where rollout engines (each hosting an LLM actor) generate trajectories, which are then asynchronously consumed by training engines for policy-gradient updates. The rollout batch size is dynamic, ensuring maximal utilization and scalability (Zhang et al., 5 Oct 2025).

3. Intermediate Interface: Planning–Grounding Separation

The central algorithmic innovation in AutoGLM is its intermediate interface, which divides agent cognition into distinct planning and grounding stages:

  • Planning: Operates in symbolic id-space, focusing on determining "what" should be done (e.g., "click element id=17").
  • Grounding: Maps these intentions to "how" they are executed at the pixel or selector level, using visual or accessibility features for accurate interaction.

Formally, the planner emits a^t=(action_type,target_id,[args])\hat{a}_t = (action\_type, target\_id, [args]) based on GUI element representations and instruction encoding, which the grounder then resolves to concrete screen-space coordinates or low-level events.

This disentanglement allows for:

  • Flexible and transferable planning policies, independent of GUI pixel layouts.
  • Incremental improvement of grounding capabilities using self-supervised data from GUI observations.
  • Empirical performance gains: The planning-grounding modularity yields up to +17.6% absolute gain in success rate on WebArena compared to end-to-end approaches (Liu et al., 2024).

4. Curriculum RL: Self-Evolving Online Training

AutoGLM implements a self-evolving curriculum RL scheme (WebRL) to address challenges of data sparsity and robustness:

  • Stage 1: Behavior cloning warm-start from ~1,000 expert trajectories, establishing an initial success rate near 22%.
  • Stage 2: Iterative RL with task mutation and augmentation:
    • Roll out πP\pi^P over the current set of tasks TkT_k.
    • Identify failed trajectories, mutate instructions to produce new variants (e.g., varying dates/times).
    • Critic VϕV_\phi filters new tasks by estimated success probability, augmenting the curriculum for the next iteration.
    • Policy and critic updated by actor-critic loss with KL penalization.

This framework employs tricks such as KL-constrained policy steps and actor-confidence-based replay. The agent thus continually adapts to harder and more diverse tasks, avoiding curriculum staleness. This method is critical for surpassing static offline datasets and demonstrates improved data efficiency and performance in empirical studies (Liu et al., 2024).

5. AgentRL Algorithmic Extensions for AutoGLM

AutoGLM incorporates and extends key AgentRL algorithmic innovations:

  • Cross-Policy Sampling: At each decision step tt, actions are sampled with equal probability from the current (θnew\theta_{\rm new}) or stale (θold\theta_{\rm old}) policy snapshot. This maintains exploration and mitigates premature convergence:

mtUniform{new,old},atπθmt(atst)m_t \sim \mathrm{Uniform}\{\text{new}, \text{old}\}, \quad a_t \sim \pi_{\theta_{m_t}}(a_t | s_t)

Importance weighting ensures unbiased surrogate loss computation for PPO/GRPO optimization (Zhang et al., 5 Oct 2025).

  • Task-Level Advantage Normalization: For each task ii, token-level advantages A^i,s,g,t,k\hat{A}_{i,s,g,t,k} are normalized to zero mean, unit variance before being used in policy updates:

A~i,s,g,t,k=A^i,s,g,t,kμiσi\tilde{A}_{i,s,g,t,k} = \frac{\hat{A}_{i,s,g,t,k} - \mu_i}{\sigma_i}

  • Modifications: Multi-GPU sharding (using FSDP over 16–32 GPUs), adaptive stale-policy update frequency, and grouped-trajectory GRPO further improve stability and efficiency in large-scale deployments (Zhang et al., 5 Oct 2025).

6. Infrastructure and Engineering Design

AutoGLM relies on robust underlying infrastructure for scalable, multi-task RL training:

  • Unified Function-Call API: Each task environment (e.g., ALFWorld, DB, OS, KG, WebShop) is accessed via a standardized call_tool(name, args) → (obs, reward, done) interface.
  • Containerized Environments: Each environment runs in an isolated Docker-like container, enabling massive horizontal scaling and resource isolation.
  • Centralized Controller: Orchestrates sample dispatch, trajectory collection, health/restart management, and exposes a gRPC/HTTP API for both actors and trainers. The design facilitates effortless integration of new environments and minimal engineering cost to add new multi-task scenarios (Zhang et al., 5 Oct 2025).

7. Empirical Results, Insights, and Limitations

AutoGLM achieves high empirical success rates, establishing state-of-the-art results across multi-turn, multi-task GUI benchmarks.

Key empirical results (Zhang et al., 5 Oct 2025, Liu et al., 2024):

Task GPT-5 Claude-4 DeepSeek-R1 AutoGLM
ALFWorld 65.4% 69.0% 51.4% 94.5%
DB 63.2% 68.4% 60.4% 70.4%
KG 64.1% 64.4% 50.2% 77.0%
OS 34.5% 51.0% 53.6% 51.7%
WebShop 33.7% 38.3% 31.0% 58.6%
Average 52.2% 58.2% 49.3% 70.4%
  • VAB-WebArena-Lite: AutoGLM 55.2% (SR), rising to 59.1% with a second attempt (baselines: GPT-4o 18.2%, WebPilot ~25%)
  • OpenTable: 96.2% (SR), outperforming GPT-4o (62.6%) and Agent Q (81.7%)
  • AndroidLab: 36.2% (SR), vs. GPT-4o 31.2%, Claude-3.5 29.0%
  • On common Chinese apps: 89.7% (SR) on physical Android devices

Ablation and generalization:

  • Removing cross-policy sampling or task-advantage normalization leads to significant performance drops (−9.7% and −11%, respectively).
  • Multi-task AutoGLM matches best-in-class single-task specialist models (67.7% vs. 67.8% on aggregate benchmarks).
  • Demonstrates out-of-distribution generalization (+1.5 and +3.0 points on BFCL-v3 benchmark and its multi-turn subset).

Key insights:

  • Intermediate interface is critical for modularity, allowing independent upgrades of planning and grounding.
  • Self-evolving curriculum RL drives continual data diversification and agent robustness.
  • Scalability is enabled by infrastructure abstraction and containerization.

Limitations:

  • Absolute success remains below 100% for complex/ambiguous GUIs and cross-app workflows.
  • Grounding errors persist in dynamic UIs.
  • RL sample efficiency is bottlenecked by simulator throughput and mutation-based task quality.
  • Reward shaping is primarily outcome-based, limiting granularity of feedback (Liu et al., 2024).

Conclusion

AutoGLM combines modular symbolic planning-grounding, large-scale asynchronous RL, cross-policy and task-normalized optimization, and scalable systems engineering to achieve leading performance in autonomous GUI control. It provides a blueprint for deployable foundation agents and research prototypes, with extensibility to more expressive reward models, improved vision-language grounding, and broader domain coverage (Zhang et al., 5 Oct 2025, Liu et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AutoGLM.