Papers
Topics
Authors
Recent
Search
2000 character limit reached

MagicGUI-RMS: Self-Evolving GUI Agent Framework

Updated 26 January 2026
  • MagicGUI-RMS is a multi-agent reward model system that autonomously evolves GUI agents through adaptive trajectory evaluation and corrective feedback.
  • It employs a hierarchical framework combining domain-specific and general-purpose reward models to optimize action selection and drive continual policy refinement.
  • Empirical results show that iterative data reflux rounds yield 2–2.5% absolute improvements in step success, ensuring scalable and robust GUI operations.

MagicGUI-RMS is a multi-agent reward model system designed to enable self-evolving graphical user interface (GUI) agents through adaptive trajectory evaluation, corrective feedback, and automated data reflux processes. Addressing the key challenges of scalable agent evaluation and large-scale, high-fidelity data generation, MagicGUI-RMS achieves robust generalization and continual improvement in dynamic GUI environments via a hierarchical reward modeling framework, structured data pipelines, and iterative feedback loops (Li et al., 19 Jan 2026).

1. System Architecture and Functional Components

MagicGUI-RMS operates as a three-stage, multi-agent reward evaluation and feedback system:

Stage 1: UI Agent Action Proposal

At each decision step, the UI agent (policy πAgent\pi_{\rm Agent}) samples an action apreda_{\rm pred} given task instruction xx, screen state ss, and history h1:t1h_{1:t-1}:

apred=πAgent(x,s,h1:t1)a_{\rm pred} = \pi_{\rm Agent}(x, s, h_{1:t-1})

Stage 2: Hierarchical Reward Evaluation

This stage sequentially applies two reward models:

Inputs (x,s,apred,h1:t1)(x, s, a_{\rm pred}, h_{1:t-1}); outputs binary correctness yDSy_{\rm DS}, rationale rDSr_{\rm DS}, corrected action acorra_{\rm corr} (if necessary), and correction rationale rcorrr_{\rm corr}.

Inputs (x,s,apred,h1:t1,yDS,rDS,acorr,rcorr)(x, s, a_{\rm pred}, h_{1:t-1}, y_{\rm DS}, r_{\rm DS}, a_{\rm corr}, r_{\rm corr}); produces semantic validation yGPy_{\rm GP}, task completion flag eGPe_{\rm GP}, and action-level preference score sGPs_{\rm GP}.

Final action selection is performed by maximizing the GP-RM scoring function over both candidate actions:

a=argmaxa{apred,acorr}RGP(azGP)a^* = \arg\max_{a\in\{a_{\rm pred},\,a_{\rm corr}\}} R_{\rm GP}(a \mid z_{\rm GP})

Stage 3: Dual Data-Reflux Loops

  • UI-Agent Data Reflux: The selected action aa^* is injected as a high-quality label to the agent’s training set.
  • RMS Data Reflux: DS-RM and GP-RM disagreements (yDSyGPy_{\rm DS} \neq y_{\rm GP}) are collected in a "hard" buffer for DS-RM fine-tuning.

This architecture allows for fine-grained action assessment while supporting continual, self-evolving learning via automated correction and feedback.

2. Structured Data Construction Pipeline

MagicGUI-RMS incorporates an automated pipeline for constructing a large, balanced reward dataset (MagicGUI-RMS-72K):

  • Rule-Based Verification:

Candidate actions are validated for type alignment, spatial validity, and semantic equivalence with ground-truth, contributing to the positive (D+D^+) or hard (DhardD^{hard}) sets.

  • Structured Perturbations:

Easy negatives (DeasyD^{easy}) are synthesized through instruction substitution and trajectory stitching; moderate negatives (DmidD^{mid}) via intention-centric grounding corrections from alternate OS-agents.

  • Intention-Centric Grounding Correction:

OS-agents generate alternative actions, which are included in DmidD^{mid} unless they match ground-truth intent and can be repaired—then added to D+D^+.

The resulting dataset contains 38.9K positives, 6.8K easy, 11.5K mid, and 15.8K hard samples. This approach ensures balanced coverage and perturbation diversity, reducing annotation costs and scaling data for robust reward model training.

3. Automated Feedback Reflux Mechanism

At every episode step, MagicGUI-RMS implements a dual-loop reflux process:

  1. The UI agent proposes apreda_{\rm pred}.
  2. DS-RM evaluates correctness, rationale, and, if incorrect, suggests acorra_{\rm corr}.
  3. GP-RM semantically validates and scores both actions, producing yGPy_{\rm GP} and sGPs_{\rm GP}.
  4. The final action aa^* is returned as expert feedback for agent retraining.
  5. Disagreements trigger storage of hard cases for DS-RM optimization.

Through iterative reflux rounds, the agent policy and DS-RM co-evolve: agent performance is guided by GP-endorsed corrective supervision, while DS-RM is refined on disagreement cases. Empirical results show each feedback reflux round yields 2–2.5% absolute improvements in step success rates for both agent and DS-RM.

4. Mathematical Objectives and Optimization

DS-RM Supervised Pre-training:

  • Binary classification (cross-entropy) and sequence generation losses for rationale and correction:

Lcls(θ)=[yDSlogpθ(yDS=1zDS)+(1yDS)logpθ(yDS=0zDS)]L_{\rm cls}(\theta) = -[y^*_{\rm DS}\log p_{\theta}(y_{\rm DS}=1|z_{\rm DS}) + (1-y^*_{\rm DS})\log p_{\theta}(y_{\rm DS}=0|z_{\rm DS})]

$L_{\rm seq}(\theta) = -\sum_t \log p_{\theta}(r^*_{\rm DS}[t]|z_{\rm DS}, r^*_{\rm DS}_{<t}) - \sum_u \log p_{\theta}(a^*_{\rm corr}[u]|z_{\rm DS}) + \ldots$

Lsup(θ)=Lcls(θ)+λseqLseq(θ)+λ2θ2L_{\rm sup}(\theta) = L_{\rm cls}(\theta) + \lambda_{\rm seq} L_{\rm seq}(\theta) + \tfrac{\lambda}{2}||\theta||^2

Reward-Guided Reinforcement Fine-tuning:

Step rewards are assigned based on agreement between predictions and ground-truth:

RDS(yDS,yGT)={+1.0yDS=yGT=1 0.5yDS=1,yGT=0 0.2yDS=0,yGT=1 +1.0yDS=yGT=0 R_{\rm DS}(y_{\rm DS},y_{\rm GT}) = \begin{cases} +1.0 & y_{\rm DS}=y_{\rm GT}=1 \ -0.5 & y_{\rm DS}=1,\,y_{\rm GT}=0 \ -0.2 & y_{\rm DS}=0,\,y_{\rm GT}=1 \ +1.0 & y_{\rm DS}=y_{\rm GT}=0 \ \end{cases}

Policy gradient update:

θθ+αE[θlogpθ(yDSzDS)(RDSb)]\theta \leftarrow \theta + \alpha\,\mathbb{E}[\nabla_\theta \log p_\theta(y_{\rm DS}|z_{\rm DS}) (R_{\rm DS} - b)]

GP-RM Training:

GP-RM is fixed (GPT-4o evaluator) and not gradient-trained; supervised losses are given for reference but not used.

Overall Optimization:

minθLsup(θ)    βJRL(θ)(DS-RM),minψLAgent(ψ)(UI Agent supervised + data reflux)\min_{\theta}\,L_{\rm sup}(\theta)\;-\;\beta\,J_{\rm RL}(\theta)\quad\text{(DS-RM)},\qquad \min_{\psi}\,L_{\rm Agent}(\psi)\quad\text{(UI Agent supervised + data reflux)}

5. Empirical Evaluation and Ablation

Step-Level Agent Accuracy

Model AC-Low TM AC-Low EM AC-High TM AC-High EM MG-39k TM MG-39k EM
UI-TARS-7B 95.2 91.8 81.6 74.4 63.1 40.9
Qwen2.5-VL-7B 94.1 85.0 75.1 62.9 70.2 32.0
MagicGUI-Agent 97.2↑ 93.5↑ 84.7↑ 76.3↑ 88.7↑ 74.1↑

MagicGUI-Agent consistently exceeds baseline accuracy on all measures, especially exact match (+33.2%).

Reward Model Discrimination Accuracy (MagicGUI-RMS-72K)

Model ALL Easy ALL Mod. ALL Hard
GPT-4o 87.6 54.6 33.5
Qwen2.5-VL-7B 48.8 46.5 7.6
MagicGUI-RMS 93.6🌟 96.1🌟 68.0🌟

MagicGUI-RMS provides substantial improvements, particularly on moderate and hard discrimination.

Self-Evolution by Data-Reflux Rounds

Round MagicGUI-Agent ALL DS-RM ALL
0 74.1% 73.6%
1 76.6% (+2.5) 76.5%
2 78.6% (+2.0) 78.3%

Each round of reflux yields a 2–2.5% absolute improvement in success rate.

Ablation Studies

  • DS-RM delivers the largest single gain on moderate/hard cases.
  • GP-RM filters semantic errors on easy cases.
  • The combined DS+GP configuration enhances OOD robustness.

Explicit Operational Knowledge (EOK) injection into DS-RM yields near-elimination of hard-case failures, with accuracies up to 96.1% on ALL Hard.

6. Practical Implications and Limitations

Implications:

MagicGUI-RMS demonstrates that scalable, fine-grained reward supervision enables GUI agents to evolve autonomously, reducing the need for human annotation. The DS-RM/GP-RM decomposition balances rigid domain-specific validation and flexible semantic scoring, improving both reliability and interpretability of agent actions. The automated data pipeline and feedback reflux facilitate continual learning in practical, rapidly changing GUI settings.

Limitations:

  • DS-RM relies on manually engineered rule sets and EOK priors, presenting challenges when adapting to new platforms.
  • GP-RM is instantiated as a frozen, closed-source GPT-4o model; open-source alternatives may not deliver equivalent semantic discrimination.
  • The system currently employs supervised learning for UI agent updates instead of direct policy gradient reinforcement, potentially limiting exploration diversity.

7. Extensions and Prospective Directions

Proposed avenues for advancing MagicGUI-RMS include:

  1. Training open-source GP-RMs via meta-reinforcement learning to mitigate dependence on proprietary evaluators.
  2. Applying contrastive or margin-ranking losses to refine discrimination between subtle positive/hard negative pairs.
  3. Expanding EOK with differentiable neural "prerequisite-checkers" for dynamic multi-step reasoning.
  4. End-to-end policy-gradient RL integration for agents using learned reward signals to enhance exploration.
  5. Extending to web and desktop GUIs by extracting interaction rules from accessibility APIs.

Taken together, these components position MagicGUI-RMS as a comprehensive, scalable infrastructure for self-improving GUI agents, facilitating iterative policy refinement, granular trajectory evaluation, and actionable feedback across heterogeneous interaction domains (Li et al., 19 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MagicGUI-RMS.