MobileRL: RL for Mobile Agents & Resource Optimization

Updated 2 July 2026

MobileRL is a reinforcement learning framework designed for mobile environments, addressing sparse rewards and dynamic, high-dimensional state spaces.
It utilizes specialized MDP formulations and algorithms like GRPO and multi-objective PPO to efficiently manage GUI automation and resource scheduling.
Benchmark results indicate significant improvements in success rates, transfer performance, and resource optimization compared to traditional approaches.

MobileRL refers to a class of reinforcement learning (RL) techniques designed for agentic decision-making, automation, and resource scheduling in mobile environments. The scope of MobileRL encompasses vision-language-model (VLM) agents for mobile GUI automation, multi-objective RL for resource management in edge/mobile computing, and the infrastructure, algorithms, and benchmarks that underpin generalization and sim-to-real transfer in these settings. The field addresses challenges of sparse and delayed rewards, heavy-tailed task difficulty, dynamic observation modalities, and hardware constraints intrinsic to mobile platforms.

1. Formal Problem Classes in MobileRL

MobileRL instances are formalized as Markov decision processes (MDPs), contextual MDPs (CMDPs), and multi-objective MDPs (MOMDPs), depending on the scenario.

Mobile GUI Agents: The environment is a finite-horizon MDP with structured, often high-dimensional state spaces:
- States ( $S$ ): Full environment state—typically a screenshot/image, structured metadata (JSON), and recurrent histories of prior actions and observations (Wu et al., 25 May 2026, Xu et al., 10 Sep 2025, Gu et al., 8 Mar 2026).
- Actions ( $A$ ): Parameterized or discrete sets of GUI operations (e.g., touch, type, swipe, click) (Xu et al., 10 Sep 2025, Wu et al., 25 May 2026, Gu et al., 25 Jun 2025).
- Transitions ( $P$ ): Deterministic or stochastic, following emulator or browser-hosted simulator logic.
- Rewards ( $R$ ): Sparse, terminal, or progress-shaped, often derived from programmatic judges in the JSON state or vision-based reward models (Wu et al., 25 May 2026, Xu et al., 10 Sep 2025, Gu et al., 25 Jun 2025).
- Objective: Maximize expected sum of (possibly discounted) rewards, $J(\pi) = E_\pi[\sum_{t=0}^T \gamma^t r_t]$ .
Contextual MDPs (CMDPs): The context $c$ encodes task instances or app templates, leading to a CMDP that measures agent generalization over a distribution of unseen contexts (Gu et al., 8 Mar 2026).
Multi-Objective RL (MOMDP): State and action spaces describe task offloading/scheduling in mobile edge computing; the vector reward consists of objectives such as latency and energy, and the goal is to learn a Pareto front of policies under unknown or user-specified preference weights (Yang et al., 2023, Xu et al., 16 Oct 2025).

2. Core Algorithms and Training Frameworks

2.1. Group Relative Policy Optimization and Extensions

Most MobileRL systems rely on trajectory-level policy-gradient algorithms:

GRPO: Computes normalized trajectory-level advantages within policy update groups. The GRPO loss penalizes off-policy divergence with KL or ratio-based clipping and supports parallel sampling via group-based rollouts (Xu et al., 10 Sep 2025, Gu et al., 8 Mar 2026, Gu et al., 25 Jun 2025).
Difficulty-ADAptive GRPO (ADAGRPO): Extends GRPO for heavy-tailed task distributions via:
- Shortest-Path Reward Adjustment (SPA): Penalizes trajectory length among successful rollouts, favoring efficient solutions.
- Difficulty-Adaptive Positive Replay (AdaPR): Buffers high-advantage rollouts for positive replay.
- Failure Curriculum Filtering (FCF): Reduces computation on persistently unsolvable tasks (Xu et al., 10 Sep 2025).

2.2. Multi-Objective RL and Actor-Critic Variants

For MEC and distributed VR settings:

Multi-Objective PPO: Maintains actor-critic pairs for each objective, optimizes convex combinations of objective-specific losses, and learns a preference-parameterized family of policies (Yang et al., 2023, Xu et al., 16 Oct 2025).
Preference Conditioning: Networks receive preference weights $\mathbf w$ as input, enabling on-the-fly scalarization between objectives (e.g., latency vs. energy).
Graph-Structured Policy Networks: Sparse graph neural networks (GNNs) encode heterogeneous entities (users, servers, spaces), combinatorially reducing the state-action space (Xu et al., 16 Oct 2025).

2.3. Curriculum and Reward Shaping

Stage-wise SFT+RL: A multi-stage pipeline—supervised fine-tuning (SFT) on demonstration trajectories, reasoning/explanation SFT, then online RL—consistently reduces cold-start inefficiency and stabilizes early training (Xu et al., 10 Sep 2025, Gu et al., 25 Jun 2025).
Reward Rescaling: Dense, progress-shaped, and adversarial failure-penalty rewards mitigate extreme sparsity and inform exploration (Wu et al., 25 May 2026, Gu et al., 25 Jun 2025).

3. Benchmarks, Datasets, and Evaluation Metrics

MobileGym-Bench: 416 task templates (160 train, 256 test) spanning 28 apps, with deterministic judges, dense rewards, and full side-effect analysis. Metrics: Success Rate (SR), Progress Rate (PR), False Complete (FC), Overdue Termination (OT), Unexpected Side Effects (USE) (Wu et al., 25 May 2026).
AndroidWorld-Generalization: CMDP regime partitioned by instance, template, and app; 20 apps, 116 templates; metrics: zero-shot SR over test splits (Gu et al., 8 Mar 2026).
Mobile-R1 Dataset/Benchmark: 28 Chinese apps, 24,521 annotated actions, 500 held-out evaluation trajectories; step and task accuracy, tail success, argument error (Gu et al., 25 Jun 2025).
MEC/VR Resource Scheduling: Evaluated via Pareto hypervolume, minimum/maximum latency/energy across problem instances; normalized hypervolume and dominated-volume metrics for front quality (Yang et al., 2023, Xu et al., 16 Oct 2025).

4. System Infrastructure and Simulation Environments

MobileGym: Browser-hosted, forkable simulation with layered JSON state, deterministic judging, and rapid instance launch (~3s boot, 400 MB RAM per instance). Capable of running hundreds of parallel agents for massive rollout throughput. Snapshots encode only mutable overlays and OS state for ms-level serialization (Wu et al., 25 May 2026).
AndroidWorld/AndroidLab: Native AVD clusters (up to 1,000 concurrent), containerized, with error recovery, asynchronous execution, and hot-snapshotting (Xu et al., 10 Sep 2025, Gu et al., 8 Mar 2026).
Sim-to-Real Transfer: Simulation-trained policies achieve >95% real-device performance retention on selected benchmarks, with JSON-based judges outperforming VLM-based vision judges in accuracy and determinism (Wu et al., 25 May 2026).

5. Generalization, Exploration, and Sample Efficiency

Zero-Shot and Few-Shot Transfer: RL-based MobileRL methods significantly improve zero-shot SR for new task instances (26.1 pp), but gains diminish for unseen templates (15.7 pp) and apps (8.3 pp), motivating interest in few-shot adaptation, meta-RL, and inductive bias (Gu et al., 8 Mar 2026, Xu et al., 10 Sep 2025).
Task-Level RL: Training with end-to-end task-level rewards (rather than only local action-level signals) enhances agent exploration, reduces local optimum convergence, and improves recovery from early mistakes ("eureka moves"), as captured in Mobile-R1's multi-turn RL methodology (Gu et al., 25 Jun 2025).
Positive Replay and Curriculum Filtering: AdaPR and FCF strategies improve efficiency in environments with sparse terminal rewards and heavy-tailed difficulty, accelerating convergence and stabilizing training (Xu et al., 10 Sep 2025).

6. Multi-Objective and Resource-Constrained MobileRL

MEC Resource Management: Multi-objective RL optimizes energy-delay tradeoff, leveraging scalarized rewards and Pareto-front recovery; the achieved hypervolume (80.7) substantially surpasses bandit (69.9) and heuristic (63.9) baselines, demonstrating the practicality of RL for dynamic, user-adaptive scheduling (Yang et al., 2023).
Spatial Computing for VR: RL-enhanced consistency models over GNNs efficiently produce deployment plans for multi-user VR applications with stringent latency/energy requirements, outperforming evolutionary and greedy baselines in normalized hypervolume and inference latency (Xu et al., 16 Oct 2025).

7. Limitations and Future Directions

Platform/Domain Constraints: Most MobileRL systems are Android-centric, often relying on programmatic GUI state or XML metadata, which does not generalize to pure-vision or cross-platform settings (Xu et al., 10 Sep 2025).
Reward Model Fidelity: VLM-based or rule-based reward models in simulation may introduce mislabels or fail to match real-user judgment, as highlighted by the higher error rates of VLM judges compared to deterministic JSON checks (Wu et al., 25 May 2026).
Experiment Scale: Despite high parallelism, scaling to broader app/device coverage and true multi-node distributed RL infrastructure remains an open engineering challenge (Gu et al., 8 Mar 2026).
Research Directions: Key frontiers include general-purpose reward verifiers, sim-to-real pipelines for diverse device form factors, pure task-level RL, and hybrid algorithms combining offline and online adaptation (Xu et al., 10 Sep 2025, Gu et al., 25 Jun 2025, Gu et al., 8 Mar 2026).

Representative MobileRL Systems and Benchmarks

System/Benchmark	Domain	Algorithmic Features	Metric Highlights
MobileRL-9B (Xu et al., 10 Sep 2025)	AndroidWorld, AndroidLab	ADAGRPO (SPA, AdaPR, FCF), multi-stage SFT+RL	SR 80.2% (AW), 53.6% (AL); top-1 state-of-the-art
MobileGym (Wu et al., 25 May 2026)	Simulated mobile GUIs	Forkable browser-hosted JSON-MDP sim, deterministic rewards	+12.8pp SR sim, 95% transfer retention real
Mobile-R1 (Gu et al., 25 Jun 2025)	Chinese app GUIs	Multi-turn, task-level RL (GRPO), VLM policy	49.4% task success, outperforms best baseline by +17pp
AndroidWorld-Generalization (Gu et al., 8 Mar 2026)	CMDP GUI agents	GRPO, zero/few-shot, async rollouts	+26.1pp instance, +15.7pp template, +8.3pp app generalization
MEC-MORL (Yang et al., 2023)	Edge resource sched.	Multi-objective PPO, scaling-invariant states	+233% hypervolume over random baseline
MO-CMPO (Xu et al., 16 Oct 2025)	VR over MEC	GNN-based consistency+RL, multi-objective	HV_norm 0.6659, 43s inference latency

MobileRL research continues to advance along axes of generalization, multi-objective optimization, and system infrastructure, systematically enabling robust, autonomous decision-making for mobile agentic systems in increasingly realistic and resource-constrained environments.

Markdown Report Issue Upgrade to Chat

References (6)

MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research (2026)

MobileRL: Online Agentic Reinforcement Learning for Mobile GUI Agents (2025)

Generalization in Online Reinforcement Learning for Mobile Agents (2026)

Mobile-R1: Towards Interactive Reinforcement Learning for VLM-Based Mobile Agent via Task-Level Rewards (2025)

Multi-objective Deep Reinforcement Learning for Mobile Edge Computing (2023)

Spatial Computing Communications for Multi-User Virtual Reality in Distributed Mobile Edge Computing Network (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MobileRL.

MobileRL: RL for Mobile Agents & Resource Optimization

1. Formal Problem Classes in MobileRL

2. Core Algorithms and Training Frameworks

2.1. Group Relative Policy Optimization and Extensions

2.2. Multi-Objective RL and Actor-Critic Variants

2.3. Curriculum and Reward Shaping

3. Benchmarks, Datasets, and Evaluation Metrics

4. System Infrastructure and Simulation Environments

5. Generalization, Exploration, and Sample Efficiency

6. Multi-Objective and Resource-Constrained MobileRL

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MobileRL: RL for Mobile Agents & Resource Optimization

1. Formal Problem Classes in MobileRL

2. Core Algorithms and Training Frameworks

2.1. Group Relative Policy Optimization and Extensions

2.2. Multi-Objective RL and Actor-Critic Variants

2.3. Curriculum and Reward Shaping

3. Benchmarks, Datasets, and Evaluation Metrics

4. System Infrastructure and Simulation Environments

5. Generalization, Exploration, and Sample Efficiency

6. Multi-Objective and Resource-Constrained MobileRL

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research