Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-platform Reinforcement Policy Optimization

Updated 5 March 2026
  • Multi-platform Reinforcement Policy Optimization is a reinforcement learning framework that trains unified policies for diverse platforms, balancing domain-specific challenges and gradient interference.
  • It employs device-conditioned policies, grouped rollout PPO loss, and cyclic device curricula to decouple gradients and ensure stable multi-domain learning.
  • MRPO demonstrates superior performance over traditional methods in large language model reasoning and GUI automation by optimizing for sample efficiency and robust generalization.

Multi-platform Reinforcement Policy Optimization (MRPO) is a class of reinforcement learning (RL) algorithms designed for training unified agent policies capable of operating across multiple heterogeneous environments or device platforms. MRPO addresses the domain transfer, optimization instability, and sample efficiency challenges arising in multi-platform automated agents, as illustrated in large-scale LLM reasoning (Wang et al., 30 Jan 2026) and general-purpose GUI agent learning (Xu et al., 15 Feb 2026).

1. Definition and Motivation

MRPO refers to RL frameworks that enable a shared policy to function effectively on a family of platforms, each with potentially distinct observation and action spaces, environment dynamics, and reward semantics. The motivation for MRPO emerges in two principal domains: (a) expanding the reasoning and problem-solving capacity of LLMs beyond the low-rank bias manifold induced by their pre-training and fine-tuning procedures (Wang et al., 30 Jan 2026), and (b) scaling GUI agents to reliably execute tasks across mobile, desktop, and web environments with minimal domain-specific engineering or catastrophic interference (Xu et al., 15 Feb 2026).

A core challenge in both domains is "gradient interference," where naïve joint training across platforms or solution modes causes conflicting updates that degrade individual or joint task performance. Furthermore, conventional RL policies are susceptible to collapsing into low-complexity latent regions (in LLMs) or platform-specific solutions, limiting transfer and generalization.

2. Algorithmic Framework

MRPO deploys device- or manifold-conditioned policy optimization combined with architectural and procedural mechanisms to control interference and support high-dimensional solution discovery. A representative MRPO pseudocode structure is as follows (Xu et al., 15 Feb 2026):

LGRPO(Gn)=1nτGn(Z(τ)Zˉ)t=1Tlogπθ(atot,d)+βKL[πθπθold]\mathcal L_{\mathrm{GRPO}(G_n)} = -\frac1n\sum_{\tau\in G_n}\big(Z(\tau)-\bar Z\big)\,\sum_{t=1}^T\log\pi_\theta(a_t\mid o_t,d) + \beta\,\mathrm{KL}[\pi_\theta\|\pi_{\theta_{\text{old}}}]7

The optimization alternates across device families or solution modes, preventing unstable gradient coupling. Key innovations include:

  • Device-/manifold-conditioned policy: πθ(ao,d)\pi_\theta(a\mid o, d) explicitly conditions on the platform or inferred latent subspace, enabling the backbone to represent device-specific semantics (Xu et al., 15 Feb 2026).
  • Grouped Rollout PPO (GRPO) loss: Group-averaged surrogate advantage is used per platform/task, along with KL control from the previous policy iterate. The loss is:

LGRPO(Gn)=1nτGn(Z(τ)Zˉ)t=1Tlogπθ(atot,d)+βKL[πθπθold]\mathcal L_{\mathrm{GRPO}(G_n)} = -\frac1n\sum_{\tau\in G_n}\big(Z(\tau)-\bar Z\big)\,\sum_{t=1}^T\log\pi_\theta(a_t\mid o_t,d) + \beta\,\mathrm{KL}[\pi_\theta\|\pi_{\theta_{\text{old}}}]

where Z(τ)Z(\tau) is the terminal success, and GnG_n a subsample of rollouts (Xu et al., 15 Feb 2026).

  • Oversample-then-select buffer: Online oversampling and unbiased subsampling to avoid collapsed rollout groups and maintain unbiasedness without off-policy data (Xu et al., 15 Feb 2026).
  • Alternating device curriculum: Cyclically optimize on a single platform at each stage to decouple gradients and stabilize multi-platform learning (Xu et al., 15 Feb 2026).

3. Addressing Platform and Latent Space Conflicts

MRPO mitigates multi-domain and latent space conflicts through a mixture of architectural, sampling, and optimization measures:

Mechanism Purpose Reference
Device-conditioned policy Disambiguates actions/observations for dd (Xu et al., 15 Feb 2026)
Alternating device optimization Avoids gradient interference, gdg_d (Xu et al., 15 Feb 2026)
Token-ID transport Aligns train/infer tokenization (Xu et al., 15 Feb 2026)
Spectral Orthogonal Exploration Ejects LLM policy from low-rank manifold (Wang et al., 30 Jan 2026)
Effective Rank regularizer Maintains high latent dimensionality (Wang et al., 30 Jan 2026)
  • In LLMs, the bias manifold Mbias=span{v1,,vk}M_{\text{bias}} = \mathrm{span}\{v_1,\ldots,v_k\}, with kdk \ll d, emerges from pre-training/supervised data, restricting exploration. MRPO breaks this ceiling using Spectral Orthogonal Exploration (SOE), which projects synthetic traces into the null space N=MbiasN = M_{\text{bias}}^\perp and fine-tunes on these high-rank trajectories (Wang et al., 30 Jan 2026).
  • In GUI agents, direct mixing of gradient signals across action sets (e.g., mobile tap vs. desktop click) produces "tug-of-war" instability. Device-conditional optimization and alternating curriculums prevent negative gradient inner products and promote stable, platform-specific learning (Xu et al., 15 Feb 2026).

4. Mathematical Formulation

In LLM MRPO (Wang et al., 30 Jan 2026):

  • For each reasoning trajectory yy, the total reward is

LGRPO(Gn)=1nτGn(Z(τ)Zˉ)t=1Tlogπθ(atot,d)+βKL[πθπθold]\mathcal L_{\mathrm{GRPO}(G_n)} = -\frac1n\sum_{\tau\in G_n}\big(Z(\tau)-\bar Z\big)\,\sum_{t=1}^T\log\pi_\theta(a_t\mid o_t,d) + \beta\,\mathrm{KL}[\pi_\theta\|\pi_{\theta_{\text{old}}}]0

with LGRPO(Gn)=1nτGn(Z(τ)Zˉ)t=1Tlogπθ(atot,d)+βKL[πθπθold]\mathcal L_{\mathrm{GRPO}(G_n)} = -\frac1n\sum_{\tau\in G_n}\big(Z(\tau)-\bar Z\big)\,\sum_{t=1}^T\log\pi_\theta(a_t\mid o_t,d) + \beta\,\mathrm{KL}[\pi_\theta\|\pi_{\theta_{\text{old}}}]1 the correctness indicator and LGRPO(Gn)=1nτGn(Z(τ)Zˉ)t=1Tlogπθ(atot,d)+βKL[πθπθold]\mathcal L_{\mathrm{GRPO}(G_n)} = -\frac1n\sum_{\tau\in G_n}\big(Z(\tau)-\bar Z\big)\,\sum_{t=1}^T\log\pi_\theta(a_t\mid o_t,d) + \beta\,\mathrm{KL}[\pi_\theta\|\pi_{\theta_{\text{old}}}]2. The effective rank of a trajectory is computed by

LGRPO(Gn)=1nτGn(Z(τ)Zˉ)t=1Tlogπθ(atot,d)+βKL[πθπθold]\mathcal L_{\mathrm{GRPO}(G_n)} = -\frac1n\sum_{\tau\in G_n}\big(Z(\tau)-\bar Z\big)\,\sum_{t=1}^T\log\pi_\theta(a_t\mid o_t,d) + \beta\,\mathrm{KL}[\pi_\theta\|\pi_{\theta_{\text{old}}}]3

where LGRPO(Gn)=1nτGn(Z(τ)Zˉ)t=1Tlogπθ(atot,d)+βKL[πθπθold]\mathcal L_{\mathrm{GRPO}(G_n)} = -\frac1n\sum_{\tau\in G_n}\big(Z(\tau)-\bar Z\big)\,\sum_{t=1}^T\log\pi_\theta(a_t\mid o_t,d) + \beta\,\mathrm{KL}[\pi_\theta\|\pi_{\theta_{\text{old}}}]4 is the normalized eigenvalue of the hidden state covariance.

In multi-platform GUI MRPO (Xu et al., 15 Feb 2026):

  • The optimization objective is

LGRPO(Gn)=1nτGn(Z(τ)Zˉ)t=1Tlogπθ(atot,d)+βKL[πθπθold]\mathcal L_{\mathrm{GRPO}(G_n)} = -\frac1n\sum_{\tau\in G_n}\big(Z(\tau)-\bar Z\big)\,\sum_{t=1}^T\log\pi_\theta(a_t\mid o_t,d) + \beta\,\mathrm{KL}[\pi_\theta\|\pi_{\theta_{\text{old}}}]5

under grouped PPO, with gradient updates restricted to one device at each stage.

  • Online oversample-and-select guarantees unbiased estimator properties:

LGRPO(Gn)=1nτGn(Z(τ)Zˉ)t=1Tlogπθ(atot,d)+βKL[πθπθold]\mathcal L_{\mathrm{GRPO}(G_n)} = -\frac1n\sum_{\tau\in G_n}\big(Z(\tau)-\bar Z\big)\,\sum_{t=1}^T\log\pi_\theta(a_t\mid o_t,d) + \beta\,\mathrm{KL}[\pi_\theta\|\pi_{\theta_{\text{old}}}]6

5. Empirical Results and Ablations

Across domains, MRPO achieves state-of-the-art or near state-of-the-art results—substantially outperforming alternative RL or domain-specific policies.

LLM Reasoning (Qwen3-4B/MRPO on math):

Model AIME24 AIME25 MATH-500 Olympiad Omni-Hard Mean
Qwen3-4B + GRPO 46.7% 36.7% 87.6% 42.1% 16.8% 46.0%
Qwen3-32B 33.3% 30.0% 79.8% 35.3% 10.8% 37.8%
MRPO (Ours) 56.7% 43.3% 88.8% 43.0% 17.4% 49.8%

Pass@32 on AIME 2024: Pure GRPO (80.4%), MRPO (89.1%) (Wang et al., 30 Jan 2026).

GUI Multi-platform Agents (GUI-Owl-1.5):

Model (32B-Instruct) OSWorld AndroidWorld WebArena
MRPO 56.5% 71.6% 46.7%
Prior SOTA 53.1% 73.3% 40.2%

Ablation findings indicate that alternating optimization and unstable-task prioritization converge more rapidly and stably than naïve joint training. This suggests decoupling updates and focusing on high-variance tasks accelerates multi-domain RL convergence (Xu et al., 15 Feb 2026).

6. Limitations and Extensions

Identified constraints include:

  • Lack of off-policy sample reuse limits data efficiency (e.g., no replay buffer in (Xu et al., 15 Feb 2026)).
  • MRPO introduces additional engineering complexity (real-time SVD, student–teacher orchestration, large pool tracking).
  • Hyperparameter sensitivity (oversample factor, group size) is incompletely characterized.
  • For LLMs, exploration outside the bias manifold risks violating alignment or safety constraints (Wang et al., 30 Jan 2026).

Proposed extensions:

  • Integrate learned advantage critics to reduce variance.
  • Introduce hierarchical decomposition: high-level planning with MRPO-trained executors.
  • Adapt cyclic curriculum to dynamically prioritize hardest domains or instability regions.
  • Explore automated certifiers for ethical alignment in the null-space and efficient spectral estimators (Wang et al., 30 Jan 2026, Xu et al., 15 Feb 2026).

7. Conceptual Significance and Impact

MRPO signifies a shift toward geometric and domain-aware RL, making it feasible to train singular agent policies with robust cross-platform generalization. In LLMs, MRPO validates a "Geometric Scaling Law"—reasoning capacity depends more on latent state dimensionality than raw parameter count. In GUI automation, open-sourced MRPO agents provide unified, high-performance baselines for practical cloud-edge environments. MRPO's strategies (alternating device curricula, oversample-then-select, and effective-rank incentives) are now regarded as principled responses to the unique optimization pathologies in multi-domain and high-dimensional policy learning (Wang et al., 30 Jan 2026, Xu et al., 15 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-platform Reinforcement Policy Optimization (MRPO).