Single-Policy REDI: Universal RL Approach
- Single-Policy REDI is a reinforcement learning framework defined by Realizability and Single-Policy Concentrability, enabling one policy to generalize across multiple domains.
- It employs modular architectures such as CPG and PF layers in quadruped robots and a primal-dual algorithm in offline RL to ensure robust performance and zero-shot transfer.
- The paradigm offers strong theoretical guarantees, including near-optimal sample complexity, while addressing issues of policy expressivity and data coverage in diverse control tasks.
Single-Policy REDI refers to reinforcement learning and control architectures where a single universal policy is optimized to perform robustly and efficiently across diverse domains or agents, under the Realizability and single-policy Concentrability (REDI) assumptions. This approach contrasts with traditional strategies that train separate policies per environment, morphology, or agent, or those requiring all-policy coverage or heavy domain randomization. Recent literature demonstrates the single-policy REDI paradigm's effectiveness in both continuous control for robot fleets and in offline RL for Markov Decision Processes (MDPs) and Markov games.
1. Conceptual Foundations: Realizability and Single-Policy Concentrability
Single-Policy REDI is grounded in two main assumptions:
- Realizability: The value function and (where applicable) density-ratio function classes are assumed to contain the true optimal solution. In formal terms, there exists and satisfying the saddle-point equations of the (regularized) linear program underlying the RL problem.
- Single-Policy Concentrability (SPC): Instead of requiring that the offline data distribution covers all possible policies (all-policy concentrability), only the optimal policy's occupancy ratio relative to is assumed bounded: . This assumption is both empirically weaker and more easily satisfied, and ensures that learning and generalization focus on the optimal (or target) policy rather than all possible behaviors (Zhan et al., 2022).
These two foundations enable the derivation of sharp sample-complexity and generalization guarantees in both offline and online RL, under substantially milder conditions than prior frameworks.
2. Policy Architecture and Learning Frameworks
Recent work highlights two illustrative instantiations of Single-Policy REDI for very different RL settings:
A. Universal Locomotion Policy for Quadrupedal Robots
In "ManyQuadrupeds: Learning a Single Locomotion Policy for Diverse Quadruped Robots," Shafiee et al. introduce a bio-inspired, hierarchical control pipeline, featuring:
- Supraspinal Drive: A multilayer perceptron (MLP) with layers 512, 256, 128, inputting body and CPG state, outputting per-leg amplitude and frequency.
- Central Pattern Generator (CPG): Each of four limbs uses an uncoupled nonlinear oscillator with amplitude and phase , modulated by the MLP:
with and in clipped ranges.
- Pattern Formation (PF) Layer: Transforms oscillator state to foot trajectories via robot-specific scaling parameters (stride, height), and applies inverse kinematics. PF parameters scale heuristically with robot size.
This design ensures a consistent observation–action space (task-space modulation) across 16 robot morphologies, relying on the PF layer alone for robot-dependent adaptation. No proprioceptive signals (e.g., joint angles/torques) enter the policy, enforcing strong morphology invariance (Shafiee et al., 2023).
B. Offline RL under REDI: Primal-Dual Algorithmic Approach
For MDPs, "Offline Reinforcement Learning with Realizability and Single-Policy Concentrability" (PRO-RL) optimizes a regularized saddle-point objective with empirical losses:
- Empirical RL objective:
where are Bellman errors, is a strongly convex regularizer, is the primal variable (value), the density ratio, and expectations are taken over the static dataset.
The algorithm alternates minimizing in and maximizing in , extracting the final policy as a modification of the data-generating behavior weighted by the learned :
This approach achieves polynomial sample complexity under REDI, dispensing with the need for all-policy coverage or Bellman-completeness (Zhan et al., 2022).
3. Training, Reward Design, and Implementation Details
Universal Locomotion Policy
- Training: Proximal Policy Optimization (PPO) in NVIDIA Isaac Gym with 16 parallel environments (1 per robot); policy evaluated and actuated at 100 Hz, CPG integrated at 1 kHz.
- State: Body orientation (roll, pitch, yaw), linear/angular velocities, four foot contact flags, four foot positions, previous action, CPG states (amplitude/phase, 8D).
- Action: 8-dimensional (per-leg amplitude and frequency).
- Reward per timestep :
promoting forward velocity, body orientation stability, and energy efficiency.
- Domain Randomization: Not used; training relies solely on natural morphologic diversity and task-space modulation.
Offline RL Algorithm (PRO-RL)
- Input: Finite dataset, value function class , density-ratio class , strongly convex regularizer , regularization parameter .
- Optimization: Empirical saddle-point solving, minimizing in and maximizing in .
- Policy Extraction: Given learned , extract a policy via normalized importance weighting over the data-generating behavior.
Both frameworks enforce a single invariant policy structure or exploratory strategy across all instances, with adaptation (if any) only permitted in non-centralized, modular components (e.g., pattern formation, post-processing layers).
4. Theoretical Guarantees and Sample Complexity
Offline RL and REDI
The main theorem for PRO-RL establishes that, under realizability and SPC alone, with strong convexity of and boundedness of , one achieves:
with . For sufficiently large , is -optimal; only single-policy, not all-policy, concentrability is required (Zhan et al., 2022).
Parallel Exploration with a Single Policy
In reward-free RL for linear MDPs, using parallel agents under a shared single policy in each episode yields an almost-linear speedup. Total required samples is
matching minimax lower bounds up to logarithmic factors. All convergence and optimism-in-the-face-of-uncertainty lemmas hold with data collected via a single policy per episode, provided all data is pooled (Cisneros-Velarde et al., 2022).
5. Empirical Results and Practical Implications
Universal Quadruped Control
- Diversity: Demonstrated robust trotting across 16 quadruped robots (masses 2–200 kg, body heights 18–100 cm, various morphologies and DoF), with a single policy architecture (Shafiee et al., 2023).
- Generalization: Withheld morphologies (HYQ, Dog3, B1) not used during training still yielded stable locomotion across those unseen structures, indicating substantial zero-shot transfer.
- Sim-to-Real Transfer: Direct deployment to commercial robots (Unitree Go1, A1) achieved robust outdoor gaiting and unprecedented load-carrying, without any per-robot fine-tuning or retraining.
- Efficiency: All robots trained simultaneously in under two hours on a single GPU, with network [512, 256, 128], and no domain randomization.
Parallel Single-Policy Exploration
Prior work on reward-free exploration in linear MDPs and Markov games confirms near-minimax optimality and almost-linear gains in sample complexity with a single-policy strategy, obviating the need for coordinated heterogeneous exploration (Cisneros-Velarde et al., 2022).
6. Limitations and Open Directions
- Policy Expressivity: For locomotion, all PF parameters were hand-scaled per robot. No automated or learned mechanism for end-to-end PF adaptation was proposed (Shafiee et al., 2023).
- Coverage: SPC does not guarantee coverage for arbitrary policies; efficacy is limited to domains (or datasets) where the optimal policy's occupancy is adequately represented.
- Task Diversity: Locomotion policy currently supports only straight-line gaits. Extension to turning, omnidirectional, or terrain-adapted behaviors remains open.
- Generalization: Zero-shot adaptation to classes outside the training regime (e.g., hexapeds, extreme mass distributions) is untested in current studies.
- Behavioral Inputs: Inductive biases introduced by acting solely in task-space may limit optimality for highly specialized morphologies or control tasks.
A plausible implication is that automating robot-specific PF parameter determination or further reducing hand-engineered adaptation steps could substantially broaden the Single-Policy REDI paradigm.
Key References:
- "ManyQuadrupeds: Learning a Single Locomotion Policy for Diverse Quadruped Robots" (Shafiee et al., 2023)
- "Offline Reinforcement Learning with Realizability and Single-policy Concentrability" (Zhan et al., 2022)
- "One Policy is Enough: Parallel Exploration with a Single Policy is Near-Optimal for Reward-Free Reinforcement Learning" (Cisneros-Velarde et al., 2022)