Projected Universal Policy (PUP)

Updated 28 November 2025

Projected Universal Policy (PUP) is a sim-to-real transfer framework that leverages two-stage system identification and a low-dimensional policy embedding to adapt control policies for dynamic robotic systems.
The framework integrates model parameter randomization, latent space projection, and Bayesian optimization to effectively bridge the gap between simulation and real hardware performance.
Experimental evaluations on the Darwin OP2 robot demonstrate that PUP outperforms nominal and robust baselines, achieving robust multi-gait biped locomotion with minimal real-world trials.

The Projected Universal Policy (PUP) is an algorithmic framework for sim-to-real transfer in dynamic robotic control, designed to address domain discrepancy between simulators and real hardware by leveraging two-stage system identification and a low-dimensional task-adaptive policy embedding. Developed in the context of bipedal locomotion for the Darwin OP2 robot, PUP integrates model parameter randomization, projection into a latent space, and Bayesian optimization to enable efficient policy adaptation and robust real-world deployment (Yu et al., 2019).

1. Mathematical Formulation

Let $\mu \in \mathbb{R}^n$ denote the vector of simulation model parameters, including friction, center-of-mass (COM), and actuator/PD control gains. The main components of PUP are:

Parameter Embedding: A projection network $f_\phi: \mathbb{R}^n \rightarrow \mathbb{R}^d$ maps $\mu$ to a low-dimensional latent variable $\eta$ ( $d \ll n$ , with $d=3$ in experiments).
Policy Network: $\pi_\theta(a|s, \eta)$ outputs control actions $a \in \mathbb{R}^m$ given system state $s$ and embedding $\eta$ .
Uniform Sampling: Parameters are sampled from $\mathcal{U}(\mu_{lb}, \mu_{ub})$ , the range identified in pre-sysID.
Training Objective: The expected simulated return,

$J_{\text{sim}}(\theta, \phi) = \mathbb{E}_{\mu \sim \mathcal{U},\, s_0 \sim p_0} \left[ \sum_{t=0}^T \gamma^{t} r(s_t, a_t) \right], \quad a_t \sim \pi_{\theta}(\cdot|s_t, f_\phi(\mu)),\, s_{t+1}=T_{\mu}(s_t,a_t)$

is maximized using PPO with the clipped surrogate loss $L^{\text{PPO}}(\theta, \phi)$ over the joint network $\{f_\phi, \pi_\theta\}$ .

2. Two-Stage System Identification

2.1 Pre-sysID (Generic Data Collection)

Pre-sysID is performed by collecting hardware trajectories $D$ using generic joint exercises and standing/falling sequences. The loss function for parameter fitting is:

$L(D, \mu) = \frac{1}{|D|} \sum_{( \bar{q}_t, \bar{g}_t ) \in D} \| q_t(\mu) - \bar{q}_t \| + \frac{10}{|D_{s,f}|} \sum_{ ( \bar{q}_t, \bar{g}_t ) \in D_{s,f} } \| g_t(\mu) - \bar{g}_t \| + \frac{20}{|D_s|} \sum_{ ( \bar{q}_t, \cdot ) \in D_s } \| \Delta C^{\text{feet}}_t(\mu) \|$

where $q_t(\mu),~g_t(\mu)$ are simulated joint positions and torso orientations, $D_{s,f}$ is the standing+falling subset, and $D_s$ is the standing subset.

Nominal parameters $\hat{\mu}$ are found via

$\hat{\mu} = \arg \min_{\mu} L(D, \mu)$

$K$ subsets $\{D_i\}$ are sampled; for each,

$\mu_i = \arg\min_{\mu} \left[ L(D_i, \mu) + w_{reg} \| \mu - \hat{\mu} \|^2 \right],\quad w_{reg}=0.05$

The parameter bounds are set per-dimension as:

$\mu_{lb} = \min_i \mu_i - 0.1 (\max_i \mu_i - \min_i \mu_i),\quad \mu_{ub} = \max_i \mu_i + 0.1 (\max_i \mu_i - \min_i \mu_i)$

CMA-ES is utilized to minimize $L(D,\mu)$ , and bounds $\mu_{lb}, \mu_{ub}$ are used to characterize simulator uncertainty for domain randomization.

2.2 Post-sysID (Task-Relevant Optimization)

Following PUP training, the latent embedding $\eta^*$ is optimized for real-hardware performance by maximizing the real-world return $J_{\text{real}}(\eta)$ , typically distance walked or time to fall, by running $\pi_\theta(\cdot|s,\eta)$ directly on hardware.

A Gaussian process surrogate for $GP(\eta) \approx J_{\text{real}}(\eta)$ is initialized with $N_0 = 5$ uniformly sampled trials, then refined over $T = 20$ Bayesian optimization steps (e.g., Expected Improvement). Each hardware trial runs the controller for a full episode.
The best $\eta^*$ observed is selected for deployment.

The total post-sysID adaptation is completed within 25 hardware trials (≤ 15 min).

3. Neural Architectures and Training

3.1 Architecture Details

Component	Architecture	Dimensions
Projection $f_\phi$	2-layer FC (64→32) with tanh, final linear to $\mathbb{R}^3$	$n=14, d=3$
Policy $\pi_\theta$	2-layer FC (256→256) with tanh, input $[s;\eta]$ , output 20-d adjustments	$s\approx 49$
State $s$	Motor positions (20), torso Euler angles (3) $\times$ 2 time-steps	$\approx$ 49
Action $a$	Joint target adjustments, 11 bins per joint	$20$

PPO hyperparameters used: learning rate $3\times10^{-4}$ , discount $\gamma=0.99$ , GAE $\lambda=0.95$ , batch size 64 trajectories, horizon $T=1024$ steps. The reward combines forward velocity ( $w_v=10.0$ ), torque regularization ( $w_a=0.01$ ), torque-velocity ( $w_w=0.005$ ), reference tracking ( $w_t=0.2$ ), and a survival bonus ( $E_c=+5$ ).

Simulation randomization includes control frequency $[25, 33]$ Hz and sensor noise (position $\pm 0.01$ rad, orientation bias $\pm 0.3$ rad).

3.2 Pipeline Pseudocode

Pre-sysID:
- Collect data $D$ .
- Find $\hat{\mu} = \arg \min_\mu L(D, \mu)$ .
- For each of $K$ subsets $D_i$ , solve for $\mu_i$ with regularization.
- Compute per-dimension bounds $\mu_{lb},\mu_{ub}$ .
PUP Training:
- Initialize $\theta, \phi$ .
- Repeat: sample $\mu \sim \mathcal{U}$ , compute $\eta = f_\phi(\mu)$ , run $\pi_\theta(\cdot|s, \eta)$ in simulation, optimize $L^{\text{PPO}}$ .
Post-sysID:
- Initialize GP with $5$ random $\eta$ .
- For $T=20$ iterations: propose new $\eta$ via acquisition, run on hardware, update GP.
- Deploy best $\eta^*$ .

4. Experimental Evaluation on Darwin OP2

4.1 Gait Results

After post-sysID, stable forward, backward, and sideways walking were achieved on the Darwin OP2. For a trip-wire at $0.8$ m:

Gait	Mean Distance to Fall (m)	Elapsed Time (s)
Forward	$0.82 \pm 0.10$	$2.6 \pm 0.3$
Backward	$0.76 \pm 0.12$	$2.4 \pm 0.4$
Sideways	$0.79 \pm 0.09$	$2.5 \pm 0.3$

Walking speed was $\approx 0.3$ m/s.

4.2 Baseline Comparison

Three approaches were compared:

Method	Distance Traveled (m) Mean $\pm$ Std (5 trials)
Nominal	$0.25 \pm 0.15$
Robust	$0.48 \pm 0.08$
PUP+BO	$0.80 \pm 0.10$

PUP with Bayesian optimization substantially outperformed both nominal and robust (domain randomization only) baselines.

4.3 Ablation: NN-PD Actuator Model

Pre-sysID loss $L(D,\mu)$ was compared for two actuator models:

PD Only: Fixed $k_p, k_d$ .
NN-PD: $k_p, k_d$ predicted by a 1-layer NN with 5 units ( $\phi \in \mathbb{R}^{27}$ ).

NN-PD achieved approximately $40\%$ lower $L$ compared to PD only over 500 CMA-ES iterations. Hip position step responses and torso pitch were closer to real data for NN-PD.

4.4 Parameter Range Analysis

Some parameters (e.g., ankle damping $\sigma_{\text{ankle}}$ ) had tight pre-sysID bounds, indicating high confidence. Others (e.g., torque limit $\tilde{\tau}$ , COM $_z$ ) exhibited wide bounds, reflecting greater simulation–real mismatch. The low-dimensional latent $\eta$ allows PUP to compensate for these ambiguities.

5. Key Principles and Significance

Projected Universal Policy enables robust sim-to-real transfer by decoupling the modeling and control challenges into:

Broad system identification (pre-sysID) to span plausible simulation variations.
Training a policy family sensitive to a low-dimensional projection of the high-dimensional model uncertainty.
Efficient real-world specialization (post-sysID) via Bayesian optimization in the latent space.

A plausible implication is that PUP serves as a general template for sim-to-real adaptation wherever model uncertainty can be characterized and projected, potentially benefiting domains beyond biped locomotion.

6. Conclusions

Projected Universal Policy combines principled domain randomization, differentiable model-to-embedding projection, and sample-efficient hardware adaptation, yielding robust and adaptive controllers on real robotic systems. On low-cost hardware, PUP enabled successful multi-gait biped locomotion using fewer than 25 real-world trials and demonstrated substantially improved performance over nominal or robust baselines (Yu et al., 2019).

PDF Markdown Chat (Pro)

References (1)

Sim-to-Real Transfer for Biped Locomotion (2019)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Projected Universal Policy (PUP).