Papers
Topics
Authors
Recent
2000 character limit reached

Projected Universal Policy (PUP)

Updated 28 November 2025
  • Projected Universal Policy (PUP) is a sim-to-real transfer framework that leverages two-stage system identification and a low-dimensional policy embedding to adapt control policies for dynamic robotic systems.
  • The framework integrates model parameter randomization, latent space projection, and Bayesian optimization to effectively bridge the gap between simulation and real hardware performance.
  • Experimental evaluations on the Darwin OP2 robot demonstrate that PUP outperforms nominal and robust baselines, achieving robust multi-gait biped locomotion with minimal real-world trials.

The Projected Universal Policy (PUP) is an algorithmic framework for sim-to-real transfer in dynamic robotic control, designed to address domain discrepancy between simulators and real hardware by leveraging two-stage system identification and a low-dimensional task-adaptive policy embedding. Developed in the context of bipedal locomotion for the Darwin OP2 robot, PUP integrates model parameter randomization, projection into a latent space, and Bayesian optimization to enable efficient policy adaptation and robust real-world deployment (Yu et al., 2019).

1. Mathematical Formulation

Let μRn\mu \in \mathbb{R}^n denote the vector of simulation model parameters, including friction, center-of-mass (COM), and actuator/PD control gains. The main components of PUP are:

  • Parameter Embedding: A projection network fϕ:RnRdf_\phi: \mathbb{R}^n \rightarrow \mathbb{R}^d maps μ\mu to a low-dimensional latent variable η\eta (dnd \ll n, with d=3d=3 in experiments).
  • Policy Network: πθ(as,η)\pi_\theta(a|s, \eta) outputs control actions aRma \in \mathbb{R}^m given system state ss and embedding η\eta.
  • Uniform Sampling: Parameters are sampled from U(μlb,μub)\mathcal{U}(\mu_{lb}, \mu_{ub}), the range identified in pre-sysID.
  • Training Objective: The expected simulated return,

Jsim(θ,ϕ)=EμU,s0p0[t=0Tγtr(st,at)],atπθ(st,fϕ(μ)),st+1=Tμ(st,at)J_{\text{sim}}(\theta, \phi) = \mathbb{E}_{\mu \sim \mathcal{U},\, s_0 \sim p_0} \left[ \sum_{t=0}^T \gamma^{t} r(s_t, a_t) \right], \quad a_t \sim \pi_{\theta}(\cdot|s_t, f_\phi(\mu)),\, s_{t+1}=T_{\mu}(s_t,a_t)

is maximized using PPO with the clipped surrogate loss LPPO(θ,ϕ)L^{\text{PPO}}(\theta, \phi) over the joint network {fϕ,πθ}\{f_\phi, \pi_\theta\}.

2. Two-Stage System Identification

2.1 Pre-sysID (Generic Data Collection)

Pre-sysID is performed by collecting hardware trajectories DD using generic joint exercises and standing/falling sequences. The loss function for parameter fitting is:

L(D,μ)=1D(qˉt,gˉt)Dqt(μ)qˉt+10Ds,f(qˉt,gˉt)Ds,fgt(μ)gˉt+20Ds(qˉt,)DsΔCtfeet(μ)L(D, \mu) = \frac{1}{|D|} \sum_{( \bar{q}_t, \bar{g}_t ) \in D} \| q_t(\mu) - \bar{q}_t \| + \frac{10}{|D_{s,f}|} \sum_{ ( \bar{q}_t, \bar{g}_t ) \in D_{s,f} } \| g_t(\mu) - \bar{g}_t \| + \frac{20}{|D_s|} \sum_{ ( \bar{q}_t, \cdot ) \in D_s } \| \Delta C^{\text{feet}}_t(\mu) \|

where qt(μ), gt(μ)q_t(\mu),~g_t(\mu) are simulated joint positions and torso orientations, Ds,fD_{s,f} is the standing+falling subset, and DsD_s is the standing subset.

  • Nominal parameters μ^\hat{\mu} are found via

μ^=argminμL(D,μ)\hat{\mu} = \arg \min_{\mu} L(D, \mu)

  • KK subsets {Di}\{D_i\} are sampled; for each,

μi=argminμ[L(Di,μ)+wregμμ^2],wreg=0.05\mu_i = \arg\min_{\mu} \left[ L(D_i, \mu) + w_{reg} \| \mu - \hat{\mu} \|^2 \right],\quad w_{reg}=0.05

  • The parameter bounds are set per-dimension as:

μlb=miniμi0.1(maxiμiminiμi),μub=maxiμi+0.1(maxiμiminiμi)\mu_{lb} = \min_i \mu_i - 0.1 (\max_i \mu_i - \min_i \mu_i),\quad \mu_{ub} = \max_i \mu_i + 0.1 (\max_i \mu_i - \min_i \mu_i)

CMA-ES is utilized to minimize L(D,μ)L(D,\mu), and bounds μlb,μub\mu_{lb}, \mu_{ub} are used to characterize simulator uncertainty for domain randomization.

2.2 Post-sysID (Task-Relevant Optimization)

Following PUP training, the latent embedding η\eta^* is optimized for real-hardware performance by maximizing the real-world return Jreal(η)J_{\text{real}}(\eta), typically distance walked or time to fall, by running πθ(s,η)\pi_\theta(\cdot|s,\eta) directly on hardware.

  • A Gaussian process surrogate for GP(η)Jreal(η)GP(\eta) \approx J_{\text{real}}(\eta) is initialized with N0=5N_0 = 5 uniformly sampled trials, then refined over T=20T = 20 Bayesian optimization steps (e.g., Expected Improvement). Each hardware trial runs the controller for a full episode.
  • The best η\eta^* observed is selected for deployment.

The total post-sysID adaptation is completed within 25 hardware trials (≤ 15 min).

3. Neural Architectures and Training

3.1 Architecture Details

Component Architecture Dimensions
Projection fϕf_\phi 2-layer FC (64→32) with tanh, final linear to R3\mathbb{R}^3 n=14,d=3n=14, d=3
Policy πθ\pi_\theta 2-layer FC (256→256) with tanh, input [s;η][s;\eta], output 20-d adjustments s49s\approx 49
State ss Motor positions (20), torso Euler angles (3) ×\times 2 time-steps \approx49
Action aa Joint target adjustments, 11 bins per joint $20$

PPO hyperparameters used: learning rate 3×1043\times10^{-4}, discount γ=0.99\gamma=0.99, GAE λ=0.95\lambda=0.95, batch size 64 trajectories, horizon T=1024T=1024 steps. The reward combines forward velocity (wv=10.0w_v=10.0), torque regularization (wa=0.01w_a=0.01), torque-velocity (ww=0.005w_w=0.005), reference tracking (wt=0.2w_t=0.2), and a survival bonus (Ec=+5E_c=+5).

Simulation randomization includes control frequency [25,33][25, 33] Hz and sensor noise (position ±0.01\pm 0.01 rad, orientation bias ±0.3\pm 0.3 rad).

3.2 Pipeline Pseudocode

  1. Pre-sysID:
    • Collect data DD.
    • Find μ^=argminμL(D,μ)\hat{\mu} = \arg \min_\mu L(D, \mu).
    • For each of KK subsets DiD_i, solve for μi\mu_i with regularization.
    • Compute per-dimension bounds μlb,μub\mu_{lb},\mu_{ub}.
  2. PUP Training:
    • Initialize θ,ϕ\theta, \phi.
    • Repeat: sample μU\mu \sim \mathcal{U}, compute η=fϕ(μ)\eta = f_\phi(\mu), run πθ(s,η)\pi_\theta(\cdot|s, \eta) in simulation, optimize LPPOL^{\text{PPO}}.
  3. Post-sysID:
    • Initialize GP with $5$ random η\eta.
    • For T=20T=20 iterations: propose new η\eta via acquisition, run on hardware, update GP.
    • Deploy best η\eta^*.

4. Experimental Evaluation on Darwin OP2

4.1 Gait Results

After post-sysID, stable forward, backward, and sideways walking were achieved on the Darwin OP2. For a trip-wire at $0.8$ m:

Gait Mean Distance to Fall (m) Elapsed Time (s)
Forward 0.82±0.100.82 \pm 0.10 2.6±0.32.6 \pm 0.3
Backward 0.76±0.120.76 \pm 0.12 2.4±0.42.4 \pm 0.4
Sideways 0.79±0.090.79 \pm 0.09 2.5±0.32.5 \pm 0.3

Walking speed was 0.3\approx 0.3 m/s.

4.2 Baseline Comparison

Three approaches were compared:

Method Distance Traveled (m) Mean ±\pm Std (5 trials)
Nominal 0.25±0.150.25 \pm 0.15
Robust 0.48±0.080.48 \pm 0.08
PUP+BO 0.80±0.100.80 \pm 0.10

PUP with Bayesian optimization substantially outperformed both nominal and robust (domain randomization only) baselines.

4.3 Ablation: NN-PD Actuator Model

Pre-sysID loss L(D,μ)L(D,\mu) was compared for two actuator models:

  • PD Only: Fixed kp,kdk_p, k_d.
  • NN-PD: kp,kdk_p, k_d predicted by a 1-layer NN with 5 units (ϕR27\phi \in \mathbb{R}^{27}).

NN-PD achieved approximately 40%40\% lower LL compared to PD only over 500 CMA-ES iterations. Hip position step responses and torso pitch were closer to real data for NN-PD.

4.4 Parameter Range Analysis

Some parameters (e.g., ankle damping σankle\sigma_{\text{ankle}}) had tight pre-sysID bounds, indicating high confidence. Others (e.g., torque limit τ~\tilde{\tau}, COMz_z) exhibited wide bounds, reflecting greater simulation–real mismatch. The low-dimensional latent η\eta allows PUP to compensate for these ambiguities.

5. Key Principles and Significance

Projected Universal Policy enables robust sim-to-real transfer by decoupling the modeling and control challenges into:

  • Broad system identification (pre-sysID) to span plausible simulation variations.
  • Training a policy family sensitive to a low-dimensional projection of the high-dimensional model uncertainty.
  • Efficient real-world specialization (post-sysID) via Bayesian optimization in the latent space.

A plausible implication is that PUP serves as a general template for sim-to-real adaptation wherever model uncertainty can be characterized and projected, potentially benefiting domains beyond biped locomotion.

6. Conclusions

Projected Universal Policy combines principled domain randomization, differentiable model-to-embedding projection, and sample-efficient hardware adaptation, yielding robust and adaptive controllers on real robotic systems. On low-cost hardware, PUP enabled successful multi-gait biped locomotion using fewer than 25 real-world trials and demonstrated substantially improved performance over nominal or robust baselines (Yu et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Projected Universal Policy (PUP).