MagicGUI-RMS: Self-Evolving GUI Agent Framework

Updated 26 January 2026

MagicGUI-RMS is a multi-agent reward model system that autonomously evolves GUI agents through adaptive trajectory evaluation and corrective feedback.
It employs a hierarchical framework combining domain-specific and general-purpose reward models to optimize action selection and drive continual policy refinement.
Empirical results show that iterative data reflux rounds yield 2–2.5% absolute improvements in step success, ensuring scalable and robust GUI operations.

MagicGUI-RMS is a multi-agent reward model system designed to enable self-evolving graphical user interface (GUI) agents through adaptive trajectory evaluation, corrective feedback, and automated data reflux processes. Addressing the key challenges of scalable agent evaluation and large-scale, high-fidelity data generation, MagicGUI-RMS achieves robust generalization and continual improvement in dynamic GUI environments via a hierarchical reward modeling framework, structured data pipelines, and iterative feedback loops (Li et al., 19 Jan 2026).

1. System Architecture and Functional Components

MagicGUI-RMS operates as a three-stage, multi-agent reward evaluation and feedback system:

Stage 1: UI Agent Action Proposal

At each decision step, the UI agent (policy $\pi_{\rm Agent}$ ) samples an action $a_{\rm pred}$ given task instruction $x$ , screen state $s$ , and history $h_{1:t-1}$ :

$a_{\rm pred} = \pi_{\rm Agent}(x, s, h_{1:t-1})$

Stage 2: Hierarchical Reward Evaluation

This stage sequentially applies two reward models:

Domain-Specific Reward Model (DS-RM):

Inputs $(x, s, a_{\rm pred}, h_{1:t-1})$ ; outputs binary correctness $y_{\rm DS}$ , rationale $r_{\rm DS}$ , corrected action $a_{\rm corr}$ (if necessary), and correction rationale $r_{\rm corr}$ .

General-Purpose Reward Model (GP-RM):

Inputs $(x, s, a_{\rm pred}, h_{1:t-1}, y_{\rm DS}, r_{\rm DS}, a_{\rm corr}, r_{\rm corr})$ ; produces semantic validation $y_{\rm GP}$ , task completion flag $e_{\rm GP}$ , and action-level preference score $s_{\rm GP}$ .

Final action selection is performed by maximizing the GP-RM scoring function over both candidate actions:

$a^* = \arg\max_{a\in\{a_{\rm pred},\,a_{\rm corr}\}} R_{\rm GP}(a \mid z_{\rm GP})$

Stage 3: Dual Data-Reflux Loops

UI-Agent Data Reflux: The selected action $a^*$ is injected as a high-quality label to the agent’s training set.
RMS Data Reflux: DS-RM and GP-RM disagreements ( $y_{\rm DS} \neq y_{\rm GP}$ ) are collected in a "hard" buffer for DS-RM fine-tuning.

This architecture allows for fine-grained action assessment while supporting continual, self-evolving learning via automated correction and feedback.

2. Structured Data Construction Pipeline

MagicGUI-RMS incorporates an automated pipeline for constructing a large, balanced reward dataset (MagicGUI-RMS-72K):

Rule-Based Verification:

Candidate actions are validated for type alignment, spatial validity, and semantic equivalence with ground-truth, contributing to the positive ( $D^+$ ) or hard ( $D^{hard}$ ) sets.

Structured Perturbations:

Easy negatives ( $D^{easy}$ ) are synthesized through instruction substitution and trajectory stitching; moderate negatives ( $D^{mid}$ ) via intention-centric grounding corrections from alternate OS-agents.

Intention-Centric Grounding Correction:

OS-agents generate alternative actions, which are included in $D^{mid}$ unless they match ground-truth intent and can be repaired—then added to $D^+$ .

The resulting dataset contains 38.9K positives, 6.8K easy, 11.5K mid, and 15.8K hard samples. This approach ensures balanced coverage and perturbation diversity, reducing annotation costs and scaling data for robust reward model training.

3. Automated Feedback Reflux Mechanism

At every episode step, MagicGUI-RMS implements a dual-loop reflux process:

The UI agent proposes $a_{\rm pred}$ .
DS-RM evaluates correctness, rationale, and, if incorrect, suggests $a_{\rm corr}$ .
GP-RM semantically validates and scores both actions, producing $y_{\rm GP}$ and $s_{\rm GP}$ .
The final action $a^*$ is returned as expert feedback for agent retraining.
Disagreements trigger storage of hard cases for DS-RM optimization.

Through iterative reflux rounds, the agent policy and DS-RM co-evolve: agent performance is guided by GP-endorsed corrective supervision, while DS-RM is refined on disagreement cases. Empirical results show each feedback reflux round yields 2–2.5% absolute improvements in step success rates for both agent and DS-RM.

4. Mathematical Objectives and Optimization

DS-RM Supervised Pre-training:

Binary classification (cross-entropy) and sequence generation losses for rationale and correction:

$L_{\rm cls}(\theta) = -[y^*_{\rm DS}\log p_{\theta}(y_{\rm DS}=1|z_{\rm DS}) + (1-y^*_{\rm DS})\log p_{\theta}(y_{\rm DS}=0|z_{\rm DS})]$

$L_{\rm seq}(\theta) = -\sum_t \log p_{\theta}(r^*_{\rm DS}[t]|z_{\rm DS}, r^*_{\rm DS}_{<t}) - \sum_u \log p_{\theta}(a^*_{\rm corr}[u]|z_{\rm DS}) + \ldots$

$L_{\rm sup}(\theta) = L_{\rm cls}(\theta) + \lambda_{\rm seq} L_{\rm seq}(\theta) + \tfrac{\lambda}{2}||\theta||^2$

Reward-Guided Reinforcement Fine-tuning:

Step rewards are assigned based on agreement between predictions and ground-truth:

$R_{\rm DS}(y_{\rm DS},y_{\rm GT}) = \begin{cases} +1.0 & y_{\rm DS}=y_{\rm GT}=1 \ -0.5 & y_{\rm DS}=1,\,y_{\rm GT}=0 \ -0.2 & y_{\rm DS}=0,\,y_{\rm GT}=1 \ +1.0 & y_{\rm DS}=y_{\rm GT}=0 \ \end{cases}$

Policy gradient update:

$\theta \leftarrow \theta + \alpha\,\mathbb{E}[\nabla_\theta \log p_\theta(y_{\rm DS}|z_{\rm DS}) (R_{\rm DS} - b)]$

GP-RM Training:

GP-RM is fixed (GPT-4o evaluator) and not gradient-trained; supervised losses are given for reference but not used.

Overall Optimization:

$\min_{\theta}\,L_{\rm sup}(\theta)\;-\;\beta\,J_{\rm RL}(\theta)\quad\text{(DS-RM)},\qquad \min_{\psi}\,L_{\rm Agent}(\psi)\quad\text{(UI Agent supervised + data reflux)}$

5. Empirical Evaluation and Ablation

Step-Level Agent Accuracy

Model	AC-Low TM	AC-Low EM	AC-High TM	AC-High EM	MG-39k TM	MG-39k EM
UI-TARS-7B	95.2	91.8	81.6	74.4	63.1	40.9
Qwen2.5-VL-7B	94.1	85.0	75.1	62.9	70.2	32.0
MagicGUI-Agent	97.2↑	93.5↑	84.7↑	76.3↑	88.7↑	74.1↑

MagicGUI-Agent consistently exceeds baseline accuracy on all measures, especially exact match (+33.2%).

Reward Model Discrimination Accuracy (MagicGUI-RMS-72K)

Model	ALL Easy	ALL Mod.	ALL Hard
GPT-4o	87.6	54.6	33.5
Qwen2.5-VL-7B	48.8	46.5	7.6
MagicGUI-RMS	93.6🌟	96.1🌟	68.0🌟

MagicGUI-RMS provides substantial improvements, particularly on moderate and hard discrimination.

Self-Evolution by Data-Reflux Rounds

Round	MagicGUI-Agent ALL	DS-RM ALL
0	74.1%	73.6%
1	76.6% (+2.5)	76.5%
2	78.6% (+2.0)	78.3%

Each round of reflux yields a 2–2.5% absolute improvement in success rate.

Ablation Studies

DS-RM delivers the largest single gain on moderate/hard cases.
GP-RM filters semantic errors on easy cases.
The combined DS+GP configuration enhances OOD robustness.

Explicit Operational Knowledge (EOK) injection into DS-RM yields near-elimination of hard-case failures, with accuracies up to 96.1% on ALL Hard.

6. Practical Implications and Limitations

Implications:

MagicGUI-RMS demonstrates that scalable, fine-grained reward supervision enables GUI agents to evolve autonomously, reducing the need for human annotation. The DS-RM/GP-RM decomposition balances rigid domain-specific validation and flexible semantic scoring, improving both reliability and interpretability of agent actions. The automated data pipeline and feedback reflux facilitate continual learning in practical, rapidly changing GUI settings.

Limitations:

DS-RM relies on manually engineered rule sets and EOK priors, presenting challenges when adapting to new platforms.
GP-RM is instantiated as a frozen, closed-source GPT-4o model; open-source alternatives may not deliver equivalent semantic discrimination.
The system currently employs supervised learning for UI agent updates instead of direct policy gradient reinforcement, potentially limiting exploration diversity.

7. Extensions and Prospective Directions

Proposed avenues for advancing MagicGUI-RMS include:

Training open-source GP-RMs via meta-reinforcement learning to mitigate dependence on proprietary evaluators.
Applying contrastive or margin-ranking losses to refine discrimination between subtle positive/hard negative pairs.
Expanding EOK with differentiable neural "prerequisite-checkers" for dynamic multi-step reasoning.
End-to-end policy-gradient RL integration for agents using learned reward signals to enhance exploration.
Extending to web and desktop GUIs by extracting interaction rules from accessibility APIs.

Taken together, these components position MagicGUI-RMS as a comprehensive, scalable infrastructure for self-improving GUI agents, facilitating iterative policy refinement, granular trajectory evaluation, and actionable feedback across heterogeneous interaction domains (Li et al., 19 Jan 2026).

Markdown Upgrade to Chat

References (1)

MagicGUI-RMS: A Multi-Agent Reward Model System for Self-Evolving GUI Agents via Automated Feedback Reflux (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MagicGUI-RMS.

MagicGUI-RMS: Self-Evolving GUI Agent Framework

1. System Architecture and Functional Components

2. Structured Data Construction Pipeline

3. Automated Feedback Reflux Mechanism

4. Mathematical Objectives and Optimization

5. Empirical Evaluation and Ablation

Step-Level Agent Accuracy

Reward Model Discrimination Accuracy (MagicGUI-RMS-72K)

Self-Evolution by Data-Reflux Rounds

Ablation Studies

6. Practical Implications and Limitations

7. Extensions and Prospective Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

MagicGUI-RMS: Self-Evolving GUI Agent Framework

1. System Architecture and Functional Components

2. Structured Data Construction Pipeline

3. Automated Feedback Reflux Mechanism

4. Mathematical Objectives and Optimization

5. Empirical Evaluation and Ablation

Step-Level Agent Accuracy

Reward Model Discrimination Accuracy (MagicGUI-RMS-72K)

Self-Evolution by Data-Reflux Rounds

Ablation Studies

6. Practical Implications and Limitations

7. Extensions and Prospective Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research