MAI-UI: Intelligent GUI Agent Framework

Updated 29 December 2025

MAI-UI is a real-world centric GUI agent framework characterized by adaptive interaction, collaborative learning, and integrated device–cloud processes.
The framework leverages multimodal perception and unified agent policies to support on-device and cloud operations with lifelong online reinforcement learning.
Benchmark results demonstrate MAI-UI’s competitive performance in GUI grounding, mobile navigation, and adaptive UI synthesis in dynamic, production-scale environments.

MAI-UI designates a family of real-world–centric, foundation-level GUI agents and methodologies for intelligent user interface control, synthesis, and grounding. These agents integrate multimodal perception, collaborative learning, online adaptation, and practical deployment architecture. Distinct MAI-UI components span from massive generalist models for device-cloud agentic collaboration to modular pipelines for interface synthesis, adaptive user modeling, and robust mobile UI navigation. The unification of these approaches addresses four central deployment challenges: enabling native agent-user dialogue, surmounting UI-only operational inefficiencies, realizing cross-device architectures with strong privacy/performance tradeoffs, and achieving resilience in dynamic environments through lifelong online learning (Zhou et al., 26 Dec 2025).

1. Core Principles and Motivations

The central goal of MAI-UI systems is to realize production-scale, foundation GUI agents capable of robust mobile and desktop interaction in highly dynamic, real-world situations. Unlike prior laboratory-bound GUI models, MAI-UI frameworks prioritize:

Production readiness: Direct agent-user questions, consent, and refusals via explicit actions (e.g., ask_user(text), answer(text)).
Tool-augmented execution: Integration of direct Model Context Protocol (MCP)/API calls for leapfrogging long GUI flows.
Device–cloud collaboration: Allocation of tasks across lightweight local agents and powerful cloud agents, optimizing for privacy, latency, and compute budget.
Lifelong online reinforcement learning: Continuous self-improvement and adaptation in parallel, high-variance environments (Zhou et al., 26 Dec 2025).

2. Model Architecture and Deployment System

Model Family and Action Space

MAI-UI encompasses a spectrum of multimodal model variants (2B, 8B, 32B, 235B-A22B), supporting deployment scenarios from fully on-device operation to hybrid device–cloud workflows. Each model is trained to process visual state, interaction context, and high-level task objectives, issuing either granular GUI actions (clicks, swipes) or abstracted tool calls (mcp_call(tool_name, args)). Explicit user-interaction primitives (ask_user, answer) are positioned as first-class actions in the agent policy.

Architecture Table:

Component	Description	Role in System
Local Agent (2B)	On-device, low-latency	Executes, monitors, privacy-guarded
Cloud Agent (32B/235B)	Device–cloud, high-capacity	Fault-recovery, context expansion
Unified Trajectory Memory	Shared across agents	Synchronizes state and context

When a local agent’s policy detects drift or repetitive mistakes, and if privacy allows, control transitions seamlessly to the cloud agent. A unified memory abstraction preserves instruction, screenshots, and action traces, allowing context exchange and policy fallback (Zhou et al., 26 Dec 2025).

3. Self-Evolving Data Pipeline and Learning Paradigm

MAI-UI’s data pipeline is structured as a closed loop, interleaving task generation, multimodal trajectory synthesis, and rejection-sampled fine-tuning. Task seeds are sourced from app manuals, expert task lists, and filtered public datasets. At trajectory synthesis, collected episodes span manual annotation and agent rollouts under current policy.

Task expansion encompasses parameter variation (L1) and object swaps (L2) using multimodal LLM prompting.
Trajectory review employs both human verification and "MLM-as-judge" for correctness, with the ability to splice correct sub-trajectories from longer attempted demos.
Iterative rejection sampling formalizes the cycle with replay and incremental correction (Zhou et al., 26 Dec 2025).

SFT and Online RL Loop (simplified):

for t in 0…T:
  M = fine_tune(M, D^(t))
  D_RS = {}
  for i in I_expansion:
    traj = rollout(M, i)
    good_prefix = judge.keep_prefix(traj)
    if good_prefix nonempty: D_RS += good_prefix
  D^(t+1) = D_RS ∪ D_synthesis

Stage 2 trajectory synthesis also augments user-agent dialogue and tool-use cases directly into the training corpus.

4. Online RL Framework and Robustness Strategies

MAI-UI’s RL framework is distinctive for its extreme parallelism and diversity in environment sampling:

Parallelized training: From 32 to 512 concurrently virtualized Android containers, supporting reset and fault tolerance.
Horizon scaling: Trajectory budgets up to 50 interactive steps (compared to earlier 15–30), facilitating handling of longer tasks and recovery from partial failure.
Token-level surrogate objectives: Application of GRPO (Generalized Regularized Policy Optimization), with clipped advantage and tailored entropy bonuses for stable on-policy updates.
Curriculum learning: autodetects task mastery and exploration frontiers, leveraging hybrid rule-based and LLM-based judges for pass/fail assessment.
Empirical gains: Scaling parallel environments from 32 to 512 yields a +5.2 point improvement in AndroidWorld success rate; increasing step budget from 15 to 50 adds +4.3 points (Zhou et al., 26 Dec 2025).

MAI-UI agents are benchmarked across GUI grounding (e.g., ScreenSpot-Pro, OSWorld-G, UI-Vision) and mobile navigation suites (AndroidWorld, MobileWorld, GUI Odyssey). The 235B-A22B variant achieves the following:

73.5% on ScreenSpot-Pro (surpassing Gemini-3-Pro)
70.9% on OSWorld-G (superseding Seed1.8)
49.2% on UI-Vision (exceeding Gemini-3-Pro, Seed1.8, UI-Venus-72B)
76.7% on AndroidWorld navigation (leading UI-Tars-2 and Gemini-2.5-Pro)
41.7% on MobileWorld real+ (far above Doubao, competitive with Gemini-3-Pro agentic frameworks) (Zhou et al., 26 Dec 2025)

This demonstrates that joint RL-powered, interactive, and tool-augmented agents can match or exceed modular planner–executor pipelines.

Benchmark Table:

Benchmark	MAI-UI-2B	MAI-UI-32B	MAI-UI-235B	Best Proprietary	Best OSS
ScreenSpot-Pro	57.4%	67.9%	73.5%	Gemini-3-Pro 72.7%	GUI-Owl-32B 58.0%
OSWorld-G	52.0%	67.6%	70.9%	Seed1.5-VL 62.9%	OpenCUA-32B 59.6%
UI-Vision	30.3%	47.1%	49.2%	Claude-3.7 8.3%	UI-Venus-72B 36.8%
AndroidWorld	49.1%	73.3%	76.7%	UI-Tars-2 73.3%
MobileWorld	24.9%	41.7%	41.7%	Doubao 20.9%

6. Cross-Disciplinary Connections: Adaptive UIs, Detection, Synthesis

MAI-UI unifies and extends lines of work in multi-agent adaptive UIs, UI element detection, high-fidelity UI synthesis, and model-based adaptive UI assessment:

MARL-based UI adaptivity: MARLUI frames UI adaptation as a fully cooperative Markov game, with hierarchical user and flat interface agents trained in simulation, enabling data-free, general-purpose adaptation policies that transfer to real users (Langerak et al., 2022).
Element detection: Region-conditioned adaptive prompt tuning with joint vision–OCR features addresses mobile UI element categorization, achieving significant mAP gains, especially in open-vocabulary generalization (Gu et al., 2023).
UI/UX synthesis and coherence: AutoGameUI introduces a two-stage multimodal pipeline for aligning independently designed UI and UX trees, using group cross-attention and integer programming, and formalizes a universal cross-platform protocol for portable design assets (Tang et al., 2024).
Model-based adaptivity assessment: HMM-based frameworks rigorously enforce regularity, constancy, and progressivity in UI adaptation, employing explicit state/observation models and user-in-the-loop adaptation events, validated via practitioner studies (Sahraoui, 2024).

7. Limitations and Future Directions

MAI-UI reveals multiple open challenges:

Physical device fidelity and OS-level limitations: Current agents are predominantly virtualized/tested in Android containers; real-device constraints (sensors, connectivity) are less explored.
Multi-modal extension: Fusing additional modalities (audio, haptics, external sensors) is an open research trajectory.
Continual adaptation: Incremental support for updating to new UI versions or app changes without retraining from scratch remains limited.
Broader domain integration: Extending tool-augmentation beyond MCP (e.g., to domain-specific SDKs) is a plausible avenue.
Adaptive user modeling: Approaches to real-time, personalized adaptation—including multi-skill curricula and inverse RL for unannotated tasks—are active directions (Langerak et al., 2022, Zhou et al., 26 Dec 2025).

References

MAI-UI Technical Report: Real-World Centric Foundation GUI Agents (Zhou et al., 26 Dec 2025)
MARLUI: Multi-Agent Reinforcement Learning for Adaptive UIs (Langerak et al., 2022)
AutoGameUI: Constructing High-Fidelity Game UIs via Multimodal Learning and Interactive Web-Based Tool (Tang et al., 2024)
Mobile User Interface Element Detection Via Adaptively Prompt Tuning (Gu et al., 2023)
A Model-based Approach to Assess Regular, Constant, and Progressive User Interface Adaptivity (Sahraoui, 2024)

Markdown Upgrade to Chat

References (5)

MAI-UI Technical Report: Real-World Centric Foundation GUI Agents (2025)

MARLUI: Multi-Agent Reinforcement Learning for Adaptive UIs (2022)

Mobile User Interface Element Detection Via Adaptively Prompt Tuning (2023)

AutoGameUI: Constructing High-Fidelity Game UIs via Multimodal Learning and Interactive Web-Based Tool (2024)

A Model-based Approach to Assess Regular, Constant, and Progressive User Interface Adaptivity (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MAI-UI.

MAI-UI: Intelligent GUI Agent Framework

1. Core Principles and Motivations

2. Model Architecture and Deployment System

Model Family and Action Space

3. Self-Evolving Data Pipeline and Learning Paradigm

4. Online RL Framework and Robustness Strategies

5. Mobile GUI Grounding, Navigation, and Detection

6. Cross-Disciplinary Connections: Adaptive UIs, Detection, Synthesis

7. Limitations and Future Directions

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

MAI-UI: Intelligent GUI Agent Framework

1. Core Principles and Motivations

2. Model Architecture and Deployment System

Model Family and Action Space

3. Self-Evolving Data Pipeline and Learning Paradigm

4. Online RL Framework and Robustness Strategies

5. Mobile GUI Grounding, Navigation, and Detection

6. Cross-Disciplinary Connections: Adaptive UIs, Detection, Synthesis

7. Limitations and Future Directions

References

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics