Unified Reinforcement Learning: UniGRPO

Updated 8 August 2025

UniGRPO is a unified reinforcement learning framework that systematically integrates training protocols, policy designs, and evaluation benchmarks across various tasks.
It employs modular and hierarchical policy structures alongside group relative optimization methods to enhance safety, efficiency, and adaptability in real-world applications.
The framework supports reproducible research through comprehensive benchmarks, open-source libraries, and advanced representation learning techniques.

Unified Reinforcement Learning (UniGRPO) refers to algorithmic frameworks and design principles that systematically unify training procedures, model architectures, evaluation standards, and practical deployment strategies across diverse reinforcement learning (RL) tasks, modalities, and environments. This unification enables seamless cross-task optimization, transfer, and generalization, often by integrating shared representations, modular training pipelines, and coordinated reward modeling. UniGRPO encompasses approaches ranging from universal RL agents for partially observable environments to pipelines for multimodal learning, modular policy structure in control and robotics, and generalized benchmarking ecosystems for RL research.

1. Unified Theoretical Frameworks

Multiple works establish a common formalism for RL agents that minimizes assumptions about the environment, generalizing beyond the Markov Decision Process (MDP) paradigm. In universal reinforcement learning (URL) (Aslanides et al., 2017), histories are formalized as sequences of action–percept pairs ( $h_{t}=a_{1}e_{1}…a_{t-1}e_{t-1}$ ) and agent policies as distributions $\pi(a_t|h_{t})$ . Fundamental constructs include the history-dependent value function: $V_{(\nu, \gamma)}^{(\pi, u)}(h_{t}) = \mathbb{E}_{\nu}^{\pi}\left[\sum_{k=t}^{\infty}\gamma_{k}^t\,u(h_{k})\right]$ and the expectimax recursion for optimal value function: $V_{(\nu, \gamma)}^{(*, u)}(h_{t}) = \lim_{m \to \infty} \max_{a_t} \sum_{e_t} \ldots \max_{a_m} \sum_{e_m} \left(\sum_{k=t}^{t+m} \gamma_{k}^t\,u(h_{k}) \prod_{j=t}^{k} \nu(e_j|h_{j-1}, a_j)\right)$

The central universal Bayesian agent, AIXI, constructs its policy via a Bayes mixture over a class of computable environments and incorporates a complexity prior, enabling principled exploration in arbitrary settings. This unified notation and framework facilitate analysis, empirical comparison, and systematic implementation for both history-based and partially observable RL agents.

2. Modular and Hierarchical Policy Structures

To address challenges in scalable deployment, several studies propose hierarchical or modular RL architectures that unify behavior and control across differing timescales and task granularities. In multi-timescale hierarchical RL for autonomous driving (Jin et al., 30 Jun 2025), the policy is split:

High-level policy produces long-timescale, hybrid actions (position/lane guidance) mapped to explicit motion trajectories.
Low-level policy operates on extended state inputs, incrementally updating control commands (steering, acceleration) while referencing high-level guidance.

Motion guidance $G$ is mapped via $\Psi(o, a^h)$ , where $o$ is a discrete behavior and $a^h$ a continuous target. Hierarchical safety mechanisms evaluate risk with artificial potential fields and proactively correct guidance or control actions if risk metrics (e.g., $K^h(G, s^h)$ ) exceed adaptive thresholds. Safety-aware termination functions $\beta(z_{t+T^l})$ ensure coordinated policy switching. This architecture improves action consistency, driving efficiency, and safety metrics over baseline policies.

3. Unified Policy Optimization via Group Relative Methods

Unified RL training frameworks increasingly employ Group Relative Policy Optimization (GRPO) and its variants—such as UniGRPO—to optimize policies jointly across multiple tasks, modalities, or candidate outputs. In multimodal large diffusion models (Yang et al., 21 May 2025), UniGRPO:

Unifies post-training for diffusion foundation models by sharing a policy-gradient scheme with diversified reward modeling across reasoning and generation tasks.
Implements group-relative advantage estimation: $\hat{A}_{i,t} = \frac{r_i - \text{mean}(\{R_i\})}{\text{std}(\{R_i\})}$ with per-token importance weighting for stable learning.
Leverages an iteratively varied masking strategy for answer tokens, maintaining full observation of question tokens to reflect real-world inference.

Objective functions use clipped policy gradients and KL regularization: $\mathcal{J}_\text{GRPO}(\theta) = \mathbb{E}_{(q,a)\sim\mathcal{D},\, \{o_i\}} \frac{1}{G} \sum_{i=1}^{G} \frac{1}{|o_i|}\sum_{t=1}^{|o_i|} \left(\min\left(r_{i,t}(\theta)\,\hat{A}_{i,t},\ \text{clip}(r_{i,t}(\theta))\,\hat{A}_{i,t}\right) - \beta\, D_\text{KL}(\pi_\theta \|\pi_\text{ref})\right)$ By integrating chain-of-thought (CoT) fine-tuning prior to RL, the model is primed for complex tasks, facilitating cross-modal generalization and performance improvements over previous baselines.

4. Unified RL Benchmarks and Environment Frameworks

Unified RL frameworks increasingly provide common environments, config schemas, and evaluation standards for fair comparison. Eden (Chen et al., 2021) introduces standardized subsystems (Action, Buff, Terrain, Item) and customizable configuration files, supporting modular extraction of task-relevant state and action spaces. The framework’s evaluation metrics, notably TTMX (task time maximum), TTMN (task time minimum), and policy information capacity (PIC), allow environment-agnostic, algorithm-agnostic performance baselining: $TTMX(\mathcal{T}) = \min_t\left\{ t \in \mathbb{N} \mid \sum_n\left[\frac{1}{\#\{\tau^n\}}\sum_{\tau^n}\prod_i p(a_i)\right] \geq th \right\}$ The unified design streamlines cross-category benchmarking, accelerates integration of hybrid RL heuristics, and underpins reproducible research.

5. Equilibrium Selection and Unified Multi-Agent RL

Recent unified RL frameworks for multi-agent systems emphasize convergence to high-quality Nash equilibria or socially optimal solutions (Zhang et al., 13 Jun 2024). A modular actor–critic design separates Q-function estimation from action selection. The actor applies general normal-form game learning rules (kernel $\hat{K}$ ) to the instantaneous Q-values, enabling plug-in application of log-linear learning (for potential maximization) or mood-based rules (for Pareto-optimality): $\pi(a,\xi) \propto \epsilon^{\gamma(a,\xi)}$ where stochastic potential $\gamma(a,\xi)$ determines stability, and the limiting distribution favors equilibria maximizing total utility or potential function $\Phi(s, a)$ .

Inductive proofs establish that these equilibrium selection mechanisms persist across Bellman recursion and multi-stage stochastic games, extending classical game theory guarantees into dynamic RL environments.

6. Unified Representation Learning and Robustness

Robust unified RL solutions leverage state representation learning and data augmentation to facilitate policy transfer and domain generalization. USRA (Hearn et al., 2022) disentangles domain-specific and domain-general latent features, uses cycle-consistency and Q-value auxiliary objectives, and applies random convolutions or color jitter augmentations during both pretraining and fine-tuning. Losses blend reconstruction, cycle, and Q-value consistency: $\mathcal{L}_\text{USRA} = \beta_1(\mathcal{L}_\text{forward} + \mathcal{L}_\text{reverse}) + \beta_2 \mathcal{L}_\text{SVEA}$ This approach yields up to 22.6% improved domain adaptation and enhanced sample efficiency, suggesting its integration into UniGRPO frameworks can deliver robust cross-domain policy learning.

7. Open-Source Libraries and Practical Implementations

Comprehensive RL libraries such as XuanCe (Liu et al., 2023) and safe-control-gym (Yuan et al., 2021) deliver unified interfaces and modular extensions. XuanCe abstracts DRL and MARL algorithms across PyTorch, TensorFlow, and MindSpore, supports over 40 algorithms, and provides unified modules for representation, policy, and learning. Safe-control-gym augments OpenAI Gym with symbolic dynamics, constraint handling, and disturbance injection, enabling side-by-side quantitative evaluation of learning-based and model-based controllers for robotics.

Open-source availability, extensive documentation, and modularity foster reproducibility and collaborative progress in unified RL systems across domains.

In summary, UniGRPO encapsulates frameworks and methodologies that unify RL policy structures, optimization objectives, evaluation metrics, and implementation pipelines, supporting both theoretical advances and deployment in complex, cross-modal, and multi-agent environments. Whether in universal Bayesian agents, multimodal policy optimization with GRPO, hierarchical driving control, or standardized benchmarks, the unification principle enables scalable, robust, and generalizable RL solutions.