MagicGUI-RMS: Self-Evolving GUI Agent Framework
- MagicGUI-RMS is a multi-agent reward model system that autonomously evolves GUI agents through adaptive trajectory evaluation and corrective feedback.
- It employs a hierarchical framework combining domain-specific and general-purpose reward models to optimize action selection and drive continual policy refinement.
- Empirical results show that iterative data reflux rounds yield 2–2.5% absolute improvements in step success, ensuring scalable and robust GUI operations.
MagicGUI-RMS is a multi-agent reward model system designed to enable self-evolving graphical user interface (GUI) agents through adaptive trajectory evaluation, corrective feedback, and automated data reflux processes. Addressing the key challenges of scalable agent evaluation and large-scale, high-fidelity data generation, MagicGUI-RMS achieves robust generalization and continual improvement in dynamic GUI environments via a hierarchical reward modeling framework, structured data pipelines, and iterative feedback loops (Li et al., 19 Jan 2026).
1. System Architecture and Functional Components
MagicGUI-RMS operates as a three-stage, multi-agent reward evaluation and feedback system:
Stage 1: UI Agent Action Proposal
At each decision step, the UI agent (policy ) samples an action given task instruction , screen state , and history :
Stage 2: Hierarchical Reward Evaluation
This stage sequentially applies two reward models:
- Domain-Specific Reward Model (DS-RM):
Inputs ; outputs binary correctness , rationale , corrected action (if necessary), and correction rationale .
- General-Purpose Reward Model (GP-RM):
Inputs ; produces semantic validation , task completion flag , and action-level preference score .
Final action selection is performed by maximizing the GP-RM scoring function over both candidate actions:
Stage 3: Dual Data-Reflux Loops
- UI-Agent Data Reflux: The selected action is injected as a high-quality label to the agent’s training set.
- RMS Data Reflux: DS-RM and GP-RM disagreements () are collected in a "hard" buffer for DS-RM fine-tuning.
This architecture allows for fine-grained action assessment while supporting continual, self-evolving learning via automated correction and feedback.
2. Structured Data Construction Pipeline
MagicGUI-RMS incorporates an automated pipeline for constructing a large, balanced reward dataset (MagicGUI-RMS-72K):
- Rule-Based Verification:
Candidate actions are validated for type alignment, spatial validity, and semantic equivalence with ground-truth, contributing to the positive () or hard () sets.
- Structured Perturbations:
Easy negatives () are synthesized through instruction substitution and trajectory stitching; moderate negatives () via intention-centric grounding corrections from alternate OS-agents.
- Intention-Centric Grounding Correction:
OS-agents generate alternative actions, which are included in unless they match ground-truth intent and can be repaired—then added to .
The resulting dataset contains 38.9K positives, 6.8K easy, 11.5K mid, and 15.8K hard samples. This approach ensures balanced coverage and perturbation diversity, reducing annotation costs and scaling data for robust reward model training.
3. Automated Feedback Reflux Mechanism
At every episode step, MagicGUI-RMS implements a dual-loop reflux process:
- The UI agent proposes .
- DS-RM evaluates correctness, rationale, and, if incorrect, suggests .
- GP-RM semantically validates and scores both actions, producing and .
- The final action is returned as expert feedback for agent retraining.
- Disagreements trigger storage of hard cases for DS-RM optimization.
Through iterative reflux rounds, the agent policy and DS-RM co-evolve: agent performance is guided by GP-endorsed corrective supervision, while DS-RM is refined on disagreement cases. Empirical results show each feedback reflux round yields 2–2.5% absolute improvements in step success rates for both agent and DS-RM.
4. Mathematical Objectives and Optimization
DS-RM Supervised Pre-training:
- Binary classification (cross-entropy) and sequence generation losses for rationale and correction:
$L_{\rm seq}(\theta) = -\sum_t \log p_{\theta}(r^*_{\rm DS}[t]|z_{\rm DS}, r^*_{\rm DS}_{<t}) - \sum_u \log p_{\theta}(a^*_{\rm corr}[u]|z_{\rm DS}) + \ldots$
Reward-Guided Reinforcement Fine-tuning:
Step rewards are assigned based on agreement between predictions and ground-truth:
Policy gradient update:
GP-RM Training:
GP-RM is fixed (GPT-4o evaluator) and not gradient-trained; supervised losses are given for reference but not used.
Overall Optimization:
5. Empirical Evaluation and Ablation
Step-Level Agent Accuracy
| Model | AC-Low TM | AC-Low EM | AC-High TM | AC-High EM | MG-39k TM | MG-39k EM |
|---|---|---|---|---|---|---|
| UI-TARS-7B | 95.2 | 91.8 | 81.6 | 74.4 | 63.1 | 40.9 |
| Qwen2.5-VL-7B | 94.1 | 85.0 | 75.1 | 62.9 | 70.2 | 32.0 |
| MagicGUI-Agent | 97.2↑ | 93.5↑ | 84.7↑ | 76.3↑ | 88.7↑ | 74.1↑ |
MagicGUI-Agent consistently exceeds baseline accuracy on all measures, especially exact match (+33.2%).
Reward Model Discrimination Accuracy (MagicGUI-RMS-72K)
| Model | ALL Easy | ALL Mod. | ALL Hard |
|---|---|---|---|
| GPT-4o | 87.6 | 54.6 | 33.5 |
| Qwen2.5-VL-7B | 48.8 | 46.5 | 7.6 |
| MagicGUI-RMS | 93.6🌟 | 96.1🌟 | 68.0🌟 |
MagicGUI-RMS provides substantial improvements, particularly on moderate and hard discrimination.
Self-Evolution by Data-Reflux Rounds
| Round | MagicGUI-Agent ALL | DS-RM ALL |
|---|---|---|
| 0 | 74.1% | 73.6% |
| 1 | 76.6% (+2.5) | 76.5% |
| 2 | 78.6% (+2.0) | 78.3% |
Each round of reflux yields a 2–2.5% absolute improvement in success rate.
Ablation Studies
- DS-RM delivers the largest single gain on moderate/hard cases.
- GP-RM filters semantic errors on easy cases.
- The combined DS+GP configuration enhances OOD robustness.
Explicit Operational Knowledge (EOK) injection into DS-RM yields near-elimination of hard-case failures, with accuracies up to 96.1% on ALL Hard.
6. Practical Implications and Limitations
Implications:
MagicGUI-RMS demonstrates that scalable, fine-grained reward supervision enables GUI agents to evolve autonomously, reducing the need for human annotation. The DS-RM/GP-RM decomposition balances rigid domain-specific validation and flexible semantic scoring, improving both reliability and interpretability of agent actions. The automated data pipeline and feedback reflux facilitate continual learning in practical, rapidly changing GUI settings.
Limitations:
- DS-RM relies on manually engineered rule sets and EOK priors, presenting challenges when adapting to new platforms.
- GP-RM is instantiated as a frozen, closed-source GPT-4o model; open-source alternatives may not deliver equivalent semantic discrimination.
- The system currently employs supervised learning for UI agent updates instead of direct policy gradient reinforcement, potentially limiting exploration diversity.
7. Extensions and Prospective Directions
Proposed avenues for advancing MagicGUI-RMS include:
- Training open-source GP-RMs via meta-reinforcement learning to mitigate dependence on proprietary evaluators.
- Applying contrastive or margin-ranking losses to refine discrimination between subtle positive/hard negative pairs.
- Expanding EOK with differentiable neural "prerequisite-checkers" for dynamic multi-step reasoning.
- End-to-end policy-gradient RL integration for agents using learned reward signals to enhance exploration.
- Extending to web and desktop GUIs by extracting interaction rules from accessibility APIs.
Taken together, these components position MagicGUI-RMS as a comprehensive, scalable infrastructure for self-improving GUI agents, facilitating iterative policy refinement, granular trajectory evaluation, and actionable feedback across heterogeneous interaction domains (Li et al., 19 Jan 2026).