Multi-platform Reinforcement Policy Optimization

Updated 5 March 2026

Multi-platform Reinforcement Policy Optimization is a reinforcement learning framework that trains unified policies for diverse platforms, balancing domain-specific challenges and gradient interference.
It employs device-conditioned policies, grouped rollout PPO loss, and cyclic device curricula to decouple gradients and ensure stable multi-domain learning.
MRPO demonstrates superior performance over traditional methods in large language model reasoning and GUI automation by optimizing for sample efficiency and robust generalization.

Multi-platform Reinforcement Policy Optimization (MRPO) is a class of reinforcement learning (RL) algorithms designed for training unified agent policies capable of operating across multiple heterogeneous environments or device platforms. MRPO addresses the domain transfer, optimization instability, and sample efficiency challenges arising in multi-platform automated agents, as illustrated in large-scale LLM reasoning (Wang et al., 30 Jan 2026) and general-purpose GUI agent learning (Xu et al., 15 Feb 2026).

1. Definition and Motivation

MRPO refers to RL frameworks that enable a shared policy to function effectively on a family of platforms, each with potentially distinct observation and action spaces, environment dynamics, and reward semantics. The motivation for MRPO emerges in two principal domains: (a) expanding the reasoning and problem-solving capacity of LLMs beyond the low-rank bias manifold induced by their pre-training and fine-tuning procedures (Wang et al., 30 Jan 2026), and (b) scaling GUI agents to reliably execute tasks across mobile, desktop, and web environments with minimal domain-specific engineering or catastrophic interference (Xu et al., 15 Feb 2026).

A core challenge in both domains is "gradient interference," where naïve joint training across platforms or solution modes causes conflicting updates that degrade individual or joint task performance. Furthermore, conventional RL policies are susceptible to collapsing into low-complexity latent regions (in LLMs) or platform-specific solutions, limiting transfer and generalization.

2. Algorithmic Framework

MRPO deploys device- or manifold-conditioned policy optimization combined with architectural and procedural mechanisms to control interference and support high-dimensional solution discovery. A representative MRPO pseudocode structure is as follows (Xu et al., 15 Feb 2026):

$\mathcal L_{\mathrm{GRPO}(G_n)} = -\frac1n\sum_{\tau\in G_n}\big(Z(\tau)-\bar Z\big)\,\sum_{t=1}^T\log\pi_\theta(a_t\mid o_t,d) + \beta\,\mathrm{KL}[\pi_\theta\|\pi_{\theta_{\text{old}}}]$ 7

The optimization alternates across device families or solution modes, preventing unstable gradient coupling. Key innovations include:

Device-/manifold-conditioned policy: $\pi_\theta(a\mid o, d)$ explicitly conditions on the platform or inferred latent subspace, enabling the backbone to represent device-specific semantics (Xu et al., 15 Feb 2026).
Grouped Rollout PPO (GRPO) loss: Group-averaged surrogate advantage is used per platform/task, along with KL control from the previous policy iterate. The loss is:

$\mathcal L_{\mathrm{GRPO}(G_n)} = -\frac1n\sum_{\tau\in G_n}\big(Z(\tau)-\bar Z\big)\,\sum_{t=1}^T\log\pi_\theta(a_t\mid o_t,d) + \beta\,\mathrm{KL}[\pi_\theta\|\pi_{\theta_{\text{old}}}]$

where $Z(\tau)$ is the terminal success, and $G_n$ a subsample of rollouts (Xu et al., 15 Feb 2026).

Oversample-then-select buffer: Online oversampling and unbiased subsampling to avoid collapsed rollout groups and maintain unbiasedness without off-policy data (Xu et al., 15 Feb 2026).
Alternating device curriculum: Cyclically optimize on a single platform at each stage to decouple gradients and stabilize multi-platform learning (Xu et al., 15 Feb 2026).

3. Addressing Platform and Latent Space Conflicts

MRPO mitigates multi-domain and latent space conflicts through a mixture of architectural, sampling, and optimization measures:

Mechanism	Purpose	Reference
Device-conditioned policy	Disambiguates actions/observations for $d$	(Xu et al., 15 Feb 2026)
Alternating device optimization	Avoids gradient interference, $g_d$	(Xu et al., 15 Feb 2026)
Token-ID transport	Aligns train/infer tokenization	(Xu et al., 15 Feb 2026)
Spectral Orthogonal Exploration	Ejects LLM policy from low-rank manifold	(Wang et al., 30 Jan 2026)
Effective Rank regularizer	Maintains high latent dimensionality	(Wang et al., 30 Jan 2026)

In LLMs, the bias manifold $M_{\text{bias}} = \mathrm{span}\{v_1,\ldots,v_k\}$ , with $k \ll d$ , emerges from pre-training/supervised data, restricting exploration. MRPO breaks this ceiling using Spectral Orthogonal Exploration (SOE), which projects synthetic traces into the null space $N = M_{\text{bias}}^\perp$ and fine-tunes on these high-rank trajectories (Wang et al., 30 Jan 2026).
In GUI agents, direct mixing of gradient signals across action sets (e.g., mobile tap vs. desktop click) produces "tug-of-war" instability. Device-conditional optimization and alternating curriculums prevent negative gradient inner products and promote stable, platform-specific learning (Xu et al., 15 Feb 2026).

4. Mathematical Formulation

In LLM MRPO (Wang et al., 30 Jan 2026):

For each reasoning trajectory $y$ , the total reward is

$\mathcal L_{\mathrm{GRPO}(G_n)} = -\frac1n\sum_{\tau\in G_n}\big(Z(\tau)-\bar Z\big)\,\sum_{t=1}^T\log\pi_\theta(a_t\mid o_t,d) + \beta\,\mathrm{KL}[\pi_\theta\|\pi_{\theta_{\text{old}}}]$ 0

with $\mathcal L_{\mathrm{GRPO}(G_n)} = -\frac1n\sum_{\tau\in G_n}\big(Z(\tau)-\bar Z\big)\,\sum_{t=1}^T\log\pi_\theta(a_t\mid o_t,d) + \beta\,\mathrm{KL}[\pi_\theta\|\pi_{\theta_{\text{old}}}]$ 1 the correctness indicator and $\mathcal L_{\mathrm{GRPO}(G_n)} = -\frac1n\sum_{\tau\in G_n}\big(Z(\tau)-\bar Z\big)\,\sum_{t=1}^T\log\pi_\theta(a_t\mid o_t,d) + \beta\,\mathrm{KL}[\pi_\theta\|\pi_{\theta_{\text{old}}}]$ 2. The effective rank of a trajectory is computed by

$\mathcal L_{\mathrm{GRPO}(G_n)} = -\frac1n\sum_{\tau\in G_n}\big(Z(\tau)-\bar Z\big)\,\sum_{t=1}^T\log\pi_\theta(a_t\mid o_t,d) + \beta\,\mathrm{KL}[\pi_\theta\|\pi_{\theta_{\text{old}}}]$ 3

where $\mathcal L_{\mathrm{GRPO}(G_n)} = -\frac1n\sum_{\tau\in G_n}\big(Z(\tau)-\bar Z\big)\,\sum_{t=1}^T\log\pi_\theta(a_t\mid o_t,d) + \beta\,\mathrm{KL}[\pi_\theta\|\pi_{\theta_{\text{old}}}]$ 4 is the normalized eigenvalue of the hidden state covariance.

In multi-platform GUI MRPO (Xu et al., 15 Feb 2026):

The optimization objective is

$\mathcal L_{\mathrm{GRPO}(G_n)} = -\frac1n\sum_{\tau\in G_n}\big(Z(\tau)-\bar Z\big)\,\sum_{t=1}^T\log\pi_\theta(a_t\mid o_t,d) + \beta\,\mathrm{KL}[\pi_\theta\|\pi_{\theta_{\text{old}}}]$ 5

under grouped PPO, with gradient updates restricted to one device at each stage.

Online oversample-and-select guarantees unbiased estimator properties:

$\mathcal L_{\mathrm{GRPO}(G_n)} = -\frac1n\sum_{\tau\in G_n}\big(Z(\tau)-\bar Z\big)\,\sum_{t=1}^T\log\pi_\theta(a_t\mid o_t,d) + \beta\,\mathrm{KL}[\pi_\theta\|\pi_{\theta_{\text{old}}}]$ 6

5. Empirical Results and Ablations

Across domains, MRPO achieves state-of-the-art or near state-of-the-art results—substantially outperforming alternative RL or domain-specific policies.

LLM Reasoning (Qwen3-4B/MRPO on math):

Model	AIME24	AIME25	MATH-500	Olympiad	Omni-Hard	Mean
Qwen3-4B + GRPO	46.7%	36.7%	87.6%	42.1%	16.8%	46.0%
Qwen3-32B	33.3%	30.0%	79.8%	35.3%	10.8%	37.8%
MRPO (Ours)	56.7%	43.3%	88.8%	43.0%	17.4%	49.8%

Pass@32 on AIME 2024: Pure GRPO (80.4%), MRPO (89.1%) (Wang et al., 30 Jan 2026).

GUI Multi-platform Agents (GUI-Owl-1.5):

Model (32B-Instruct)	OSWorld	AndroidWorld	WebArena
MRPO	56.5%	71.6%	46.7%
Prior SOTA	53.1%	73.3%	40.2%

Ablation findings indicate that alternating optimization and unstable-task prioritization converge more rapidly and stably than naïve joint training. This suggests decoupling updates and focusing on high-variance tasks accelerates multi-domain RL convergence (Xu et al., 15 Feb 2026).

6. Limitations and Extensions

Identified constraints include:

Lack of off-policy sample reuse limits data efficiency (e.g., no replay buffer in (Xu et al., 15 Feb 2026)).
MRPO introduces additional engineering complexity (real-time SVD, student–teacher orchestration, large pool tracking).
Hyperparameter sensitivity (oversample factor, group size) is incompletely characterized.
For LLMs, exploration outside the bias manifold risks violating alignment or safety constraints (Wang et al., 30 Jan 2026).

Proposed extensions:

Integrate learned advantage critics to reduce variance.
Introduce hierarchical decomposition: high-level planning with MRPO-trained executors.
Adapt cyclic curriculum to dynamically prioritize hardest domains or instability regions.
Explore automated certifiers for ethical alignment in the null-space and efficient spectral estimators (Wang et al., 30 Jan 2026, Xu et al., 15 Feb 2026).

7. Conceptual Significance and Impact

MRPO signifies a shift toward geometric and domain-aware RL, making it feasible to train singular agent policies with robust cross-platform generalization. In LLMs, MRPO validates a "Geometric Scaling Law"—reasoning capacity depends more on latent state dimensionality than raw parameter count. In GUI automation, open-sourced MRPO agents provide unified, high-performance baselines for practical cloud-edge environments. MRPO's strategies (alternating device curricula, oversample-then-select, and effective-rank incentives) are now regarded as principled responses to the unique optimization pathologies in multi-domain and high-dimensional policy learning (Wang et al., 30 Jan 2026, Xu et al., 15 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (2)

Beyond Alignment: Expanding Reasoning Capacity via Manifold-Reshaping Policy Optimization (2026)

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-platform Reinforcement Policy Optimization (MRPO).

Multi-platform Reinforcement Policy Optimization

1. Definition and Motivation

2. Algorithmic Framework

3. Addressing Platform and Latent Space Conflicts

4. Mathematical Formulation

5. Empirical Results and Ablations

6. Limitations and Extensions

7. Conceptual Significance and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Multi-platform Reinforcement Policy Optimization

1. Definition and Motivation

2. Algorithmic Framework

3. Addressing Platform and Latent Space Conflicts

4. Mathematical Formulation

5. Empirical Results and Ablations

6. Limitations and Extensions

7. Conceptual Significance and Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research