Papers
Topics
Authors
Recent
Search
2000 character limit reached

Single-Policy REDI: Universal RL Approach

Updated 26 February 2026
  • Single-Policy REDI is a reinforcement learning framework defined by Realizability and Single-Policy Concentrability, enabling one policy to generalize across multiple domains.
  • It employs modular architectures such as CPG and PF layers in quadruped robots and a primal-dual algorithm in offline RL to ensure robust performance and zero-shot transfer.
  • The paradigm offers strong theoretical guarantees, including near-optimal sample complexity, while addressing issues of policy expressivity and data coverage in diverse control tasks.

Single-Policy REDI refers to reinforcement learning and control architectures where a single universal policy is optimized to perform robustly and efficiently across diverse domains or agents, under the Realizability and single-policy Concentrability (REDI) assumptions. This approach contrasts with traditional strategies that train separate policies per environment, morphology, or agent, or those requiring all-policy coverage or heavy domain randomization. Recent literature demonstrates the single-policy REDI paradigm's effectiveness in both continuous control for robot fleets and in offline RL for Markov Decision Processes (MDPs) and Markov games.

1. Conceptual Foundations: Realizability and Single-Policy Concentrability

Single-Policy REDI is grounded in two main assumptions:

  • Realizability: The value function and (where applicable) density-ratio function classes are assumed to contain the true optimal solution. In formal terms, there exists vVv^* \in V and wWw^* \in W satisfying the saddle-point equations of the (regularized) linear program underlying the RL problem.
  • Single-Policy Concentrability (SPC): Instead of requiring that the offline data distribution dDd^D covers all possible policies (all-policy concentrability), only the optimal policy's occupancy ratio relative to dDd^D is assumed bounded: dπ/dDBw\|d^{\pi^*}/d^D\|_\infty \leq B_w. This assumption is both empirically weaker and more easily satisfied, and ensures that learning and generalization focus on the optimal (or target) policy rather than all possible behaviors (Zhan et al., 2022).

These two foundations enable the derivation of sharp sample-complexity and generalization guarantees in both offline and online RL, under substantially milder conditions than prior frameworks.

2. Policy Architecture and Learning Frameworks

Recent work highlights two illustrative instantiations of Single-Policy REDI for very different RL settings:

A. Universal Locomotion Policy for Quadrupedal Robots

In "ManyQuadrupeds: Learning a Single Locomotion Policy for Diverse Quadruped Robots," Shafiee et al. introduce a bio-inspired, hierarchical control pipeline, featuring:

  • Supraspinal Drive: A multilayer perceptron (MLP) with layers 512, 256, 128, inputting body and CPG state, outputting per-leg amplitude and frequency.
  • Central Pattern Generator (CPG): Each of four limbs uses an uncoupled nonlinear oscillator with amplitude rir_i and phase ϕi\phi_i, modulated by the MLP:

r˙i=α(Airi),ϕ˙i=ωi\dot r_i = \alpha(A_i - r_i), \quad \dot\phi_i = \omega_i

with AiA_i and ωi\omega_i in clipped ranges.

  • Pattern Formation (PF) Layer: Transforms oscillator state to foot trajectories via robot-specific scaling parameters (stride, height), and applies inverse kinematics. PF parameters scale heuristically with robot size.

This design ensures a consistent observation–action space (task-space modulation) across 16 robot morphologies, relying on the PF layer alone for robot-dependent adaptation. No proprioceptive signals (e.g., joint angles/torques) enter the policy, enforcing strong morphology invariance (Shafiee et al., 2023).

B. Offline RL under REDI: Primal-Dual Algorithmic Approach

For MDPs, "Offline Reinforcement Learning with Realizability and Single-Policy Concentrability" (PRO-RL) optimizes a regularized saddle-point objective with empirical losses:

  • Empirical RL objective:

L^α(v,w)=(1γ)vˉμ0αfˉ(w)+wevˉ\hat{L}_\alpha(v, w) = (1-\gamma)\bar{v}_{\mu_0} - \alpha \bar{f}(w) + \bar{w e_v}

where eve_v are Bellman errors, ff is a strongly convex regularizer, vv is the primal variable (value), ww the density ratio, and expectations are taken over the static dataset.

The algorithm alternates minimizing in vv and maximizing in ww, extracting the final policy as a modification of the data-generating behavior weighted by the learned ww:

π^(as)=w^(s,a)πD(as)aw^(s,a)πD(as)\hat{\pi}(a|s) = \frac{\hat{w}(s,a)\pi_D(a|s)}{\sum_{a'} \hat{w}(s,a')\pi_D(a'|s)}

This approach achieves polynomial sample complexity under REDI, dispensing with the need for all-policy coverage or Bellman-completeness (Zhan et al., 2022).

3. Training, Reward Design, and Implementation Details

Universal Locomotion Policy

  • Training: Proximal Policy Optimization (PPO) in NVIDIA Isaac Gym with 16 parallel environments (1 per robot); policy evaluated and actuated at 100 Hz, CPG integrated at 1 kHz.
  • State: Body orientation (roll, pitch, yaw), linear/angular velocities, four foot contact flags, four foot positions, previous action, CPG states (amplitude/phase, 8D).
  • Action: 8-dimensional (per-leg amplitude and frequency).
  • Reward per timestep tt:

Rt=8.0min(vx,t,1.5)0.25θbase,t105τtq˙tR_t = 8.0\, \min(v_{x,t}, 1.5) - 0.25\, \|\boldsymbol{\theta}_{\text{base},t}\| - 10^{-5}\, \tau_t^\top \dot{q}_t

promoting forward velocity, body orientation stability, and energy efficiency.

  • Domain Randomization: Not used; training relies solely on natural morphologic diversity and task-space modulation.

Offline RL Algorithm (PRO-RL)

  • Input: Finite dataset, value function class VV, density-ratio class WW, strongly convex regularizer ff, regularization parameter α\alpha.
  • Optimization: Empirical saddle-point solving, minimizing in vv and maximizing in ww.
  • Policy Extraction: Given learned ww, extract a policy via normalized importance weighting over the data-generating behavior.

Both frameworks enforce a single invariant policy structure or exploratory strategy across all instances, with adaptation (if any) only permitted in non-centralized, modular components (e.g., pattern formation, post-processing layers).

4. Theoretical Guarantees and Sample Complexity

Offline RL and REDI

The main theorem for PRO-RL establishes that, under realizability and SPC alone, with strong convexity of ff and boundedness of V,WV, W, one achieves:

J(dα)J(π^)41γEn1,n0,ααMfJ(d^*_\alpha) - J(\hat{\pi}) \leq \frac{4}{1-\gamma} \sqrt{\frac{E_{n_1, n_0, \alpha}}{\alpha M_f}}

with En1,n0,α=O((1γ)BvlnVn0+(αBf+BwBe)ln(VW)n1)E_{n_1, n_0, \alpha} = O\left((1-\gamma) B_v \sqrt{\frac{\ln |V|}{n_0}} + (\alpha B_f + B_w B_e) \sqrt{\frac{\ln(|V||W|)}{n_1}}\right). For sufficiently large n1,n0=poly(1/ϵ)n_1, n_0 = \mathrm{poly}(1/\epsilon), π^\hat{\pi} is ϵ\epsilon-optimal; only single-policy, not all-policy, concentrability is required (Zhan et al., 2022).

Parallel Exploration with a Single Policy

In reward-free RL for linear MDPs, using PP parallel agents under a shared single policy in each episode yields an almost-linear speedup. Total required samples is

KP=Ω~(d2H3/ϵ2)KP = \tilde{\Omega}(d^2 H^3 / \epsilon^2)

matching minimax lower bounds up to logarithmic factors. All convergence and optimism-in-the-face-of-uncertainty lemmas hold with data collected via a single policy per episode, provided all data is pooled (Cisneros-Velarde et al., 2022).

5. Empirical Results and Practical Implications

Universal Quadruped Control

  • Diversity: Demonstrated robust trotting across 16 quadruped robots (masses 2–200 kg, body heights 18–100 cm, various morphologies and DoF), with a single policy architecture (Shafiee et al., 2023).
  • Generalization: Withheld morphologies (HYQ, Dog3, B1) not used during training still yielded stable locomotion across those unseen structures, indicating substantial zero-shot transfer.
  • Sim-to-Real Transfer: Direct deployment to commercial robots (Unitree Go1, A1) achieved robust outdoor gaiting and unprecedented load-carrying, without any per-robot fine-tuning or retraining.
  • Efficiency: All robots trained simultaneously in under two hours on a single GPU, with network [512, 256, 128], and no domain randomization.

Parallel Single-Policy Exploration

Prior work on reward-free exploration in linear MDPs and Markov games confirms near-minimax optimality and almost-linear gains in sample complexity with a single-policy strategy, obviating the need for coordinated heterogeneous exploration (Cisneros-Velarde et al., 2022).

6. Limitations and Open Directions

  • Policy Expressivity: For locomotion, all PF parameters were hand-scaled per robot. No automated or learned mechanism for end-to-end PF adaptation was proposed (Shafiee et al., 2023).
  • Coverage: SPC does not guarantee coverage for arbitrary policies; efficacy is limited to domains (or datasets) where the optimal policy's occupancy is adequately represented.
  • Task Diversity: Locomotion policy currently supports only straight-line gaits. Extension to turning, omnidirectional, or terrain-adapted behaviors remains open.
  • Generalization: Zero-shot adaptation to classes outside the training regime (e.g., hexapeds, extreme mass distributions) is untested in current studies.
  • Behavioral Inputs: Inductive biases introduced by acting solely in task-space may limit optimality for highly specialized morphologies or control tasks.

A plausible implication is that automating robot-specific PF parameter determination or further reducing hand-engineered adaptation steps could substantially broaden the Single-Policy REDI paradigm.


Key References:

  • "ManyQuadrupeds: Learning a Single Locomotion Policy for Diverse Quadruped Robots" (Shafiee et al., 2023)
  • "Offline Reinforcement Learning with Realizability and Single-policy Concentrability" (Zhan et al., 2022)
  • "One Policy is Enough: Parallel Exploration with a Single Policy is Near-Optimal for Reward-Free Reinforcement Learning" (Cisneros-Velarde et al., 2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Single-Policy REDI.