Projected Universal Policy (PUP)
- Projected Universal Policy (PUP) is a sim-to-real transfer framework that leverages two-stage system identification and a low-dimensional policy embedding to adapt control policies for dynamic robotic systems.
- The framework integrates model parameter randomization, latent space projection, and Bayesian optimization to effectively bridge the gap between simulation and real hardware performance.
- Experimental evaluations on the Darwin OP2 robot demonstrate that PUP outperforms nominal and robust baselines, achieving robust multi-gait biped locomotion with minimal real-world trials.
The Projected Universal Policy (PUP) is an algorithmic framework for sim-to-real transfer in dynamic robotic control, designed to address domain discrepancy between simulators and real hardware by leveraging two-stage system identification and a low-dimensional task-adaptive policy embedding. Developed in the context of bipedal locomotion for the Darwin OP2 robot, PUP integrates model parameter randomization, projection into a latent space, and Bayesian optimization to enable efficient policy adaptation and robust real-world deployment (Yu et al., 2019).
1. Mathematical Formulation
Let denote the vector of simulation model parameters, including friction, center-of-mass (COM), and actuator/PD control gains. The main components of PUP are:
- Parameter Embedding: A projection network maps to a low-dimensional latent variable (, with in experiments).
- Policy Network: outputs control actions given system state and embedding .
- Uniform Sampling: Parameters are sampled from , the range identified in pre-sysID.
- Training Objective: The expected simulated return,
is maximized using PPO with the clipped surrogate loss over the joint network .
2. Two-Stage System Identification
2.1 Pre-sysID (Generic Data Collection)
Pre-sysID is performed by collecting hardware trajectories using generic joint exercises and standing/falling sequences. The loss function for parameter fitting is:
where are simulated joint positions and torso orientations, is the standing+falling subset, and is the standing subset.
- Nominal parameters are found via
- subsets are sampled; for each,
- The parameter bounds are set per-dimension as:
CMA-ES is utilized to minimize , and bounds are used to characterize simulator uncertainty for domain randomization.
2.2 Post-sysID (Task-Relevant Optimization)
Following PUP training, the latent embedding is optimized for real-hardware performance by maximizing the real-world return , typically distance walked or time to fall, by running directly on hardware.
- A Gaussian process surrogate for is initialized with uniformly sampled trials, then refined over Bayesian optimization steps (e.g., Expected Improvement). Each hardware trial runs the controller for a full episode.
- The best observed is selected for deployment.
The total post-sysID adaptation is completed within 25 hardware trials (≤ 15 min).
3. Neural Architectures and Training
3.1 Architecture Details
| Component | Architecture | Dimensions |
|---|---|---|
| Projection | 2-layer FC (64→32) with tanh, final linear to | |
| Policy | 2-layer FC (256→256) with tanh, input , output 20-d adjustments | |
| State | Motor positions (20), torso Euler angles (3) 2 time-steps | 49 |
| Action | Joint target adjustments, 11 bins per joint | $20$ |
PPO hyperparameters used: learning rate , discount , GAE , batch size 64 trajectories, horizon steps. The reward combines forward velocity (), torque regularization (), torque-velocity (), reference tracking (), and a survival bonus ().
Simulation randomization includes control frequency Hz and sensor noise (position rad, orientation bias rad).
3.2 Pipeline Pseudocode
- Pre-sysID:
- Collect data .
- Find .
- For each of subsets , solve for with regularization.
- Compute per-dimension bounds .
- PUP Training:
- Initialize .
- Repeat: sample , compute , run in simulation, optimize .
- Post-sysID:
- Initialize GP with $5$ random .
- For iterations: propose new via acquisition, run on hardware, update GP.
- Deploy best .
4. Experimental Evaluation on Darwin OP2
4.1 Gait Results
After post-sysID, stable forward, backward, and sideways walking were achieved on the Darwin OP2. For a trip-wire at $0.8$ m:
| Gait | Mean Distance to Fall (m) | Elapsed Time (s) |
|---|---|---|
| Forward | ||
| Backward | ||
| Sideways |
Walking speed was m/s.
4.2 Baseline Comparison
Three approaches were compared:
| Method | Distance Traveled (m) Mean Std (5 trials) |
|---|---|
| Nominal | |
| Robust | |
| PUP+BO |
PUP with Bayesian optimization substantially outperformed both nominal and robust (domain randomization only) baselines.
4.3 Ablation: NN-PD Actuator Model
Pre-sysID loss was compared for two actuator models:
- PD Only: Fixed .
- NN-PD: predicted by a 1-layer NN with 5 units ().
NN-PD achieved approximately lower compared to PD only over 500 CMA-ES iterations. Hip position step responses and torso pitch were closer to real data for NN-PD.
4.4 Parameter Range Analysis
Some parameters (e.g., ankle damping ) had tight pre-sysID bounds, indicating high confidence. Others (e.g., torque limit , COM) exhibited wide bounds, reflecting greater simulation–real mismatch. The low-dimensional latent allows PUP to compensate for these ambiguities.
5. Key Principles and Significance
Projected Universal Policy enables robust sim-to-real transfer by decoupling the modeling and control challenges into:
- Broad system identification (pre-sysID) to span plausible simulation variations.
- Training a policy family sensitive to a low-dimensional projection of the high-dimensional model uncertainty.
- Efficient real-world specialization (post-sysID) via Bayesian optimization in the latent space.
A plausible implication is that PUP serves as a general template for sim-to-real adaptation wherever model uncertainty can be characterized and projected, potentially benefiting domains beyond biped locomotion.
6. Conclusions
Projected Universal Policy combines principled domain randomization, differentiable model-to-embedding projection, and sample-efficient hardware adaptation, yielding robust and adaptive controllers on real robotic systems. On low-cost hardware, PUP enabled successful multi-gait biped locomotion using fewer than 25 real-world trials and demonstrated substantially improved performance over nominal or robust baselines (Yu et al., 2019).