- The paper demonstrates that a two-phase training curriculum with randomized terrains and dynamics significantly enhances humanoid walking on compliant and uneven surfaces.
- The method leverages a feedforward MLP and Proximal Policy Optimization, applying intra-episode dynamics randomization to ensure effective sim-to-real transfer without terrain-specific tuning.
- Real-robot experiments on the HRP-5P validate robust adaptability across various surfaces, though challenges remain with steep slopes and obstacles exceeding 4cm.
This paper explores the use of sim-to-real deep reinforcement learning (RL) to create robust locomotion controllers for humanoid robots, specifically targeting compliant (soft) and uneven terrains using only proprioceptive feedback (i.e., blind locomotion) (Robust Humanoid Walking on Compliant and Uneven Terrain with Deep Reinforcement Learning, 18 Apr 2025). The primary contribution is demonstrating that a relatively simple training curriculum exposing the RL agent to randomized terrains in simulation enables robust walking on the real HRP-5P humanoid across various challenging surfaces without needing parameter tuning between environments.
Problem: Traditional model-based controllers for humanoid walking often rely on precise assumptions about foot contact timing and terrain properties. These assumptions break down on unpredictable surfaces like soft ground or uneven obstacles, leading to instability, especially for heavy humanoids like HRP-5P. Developing a single controller that adapts implicitly to different terrains without exteroceptive sensing (like cameras or LiDAR) is a significant challenge.
Proposed Approach & Implementation:
The core method involves training an end-to-end locomotion policy using Proximal Policy Optimization (PPO) in the MuJoCo simulator and deploying it zero-shot onto the HRP-5P robot.
- Policy Architecture: A feedforward Multilayer Perceptron (MLP) with 2 hidden layers of 256 units each (ReLU activation) is used for both the actor and critic.
- Observations: The policy takes proprioceptive inputs and task commands:
- Robot State: Root orientation (roll, pitch), root angular velocity, joint positions, joint velocities, motor currents (scaled to torques).
- External State: Commanded walking mode (one-hot: Standing, Inplace, Forward), mode reference (target forward/turning speed), and a cyclic clock signal (sin(2πϕ/L), cos(2πϕ/L)) indicating gait phase.
- Actions: The policy outputs target positions for the 12 leg joints. These are added to nominal offsets and tracked using low-gain PD controllers running at 1 kHz. The policy itself runs at 40 Hz.
- PD Gains (Kp,Kd): Specified per joint type (e.g., Hip Yaw: 200, 20; Ankle Roll: 80, 8).
1
2
|
// Pseudocode for PD control
tau_pd = Kp * ( (policy_output + nominal_offset) - current_joint_position ) + Kd * ( 0 - current_joint_velocity ); |
- Reward Function: A weighted sum encourages:
- Bipedal gait patterns (using clock-based foot force/speed terms).
- Tracking commanded velocity (forward/turning).
- Safety and realism (maintaining root height, limiting upper body motion, staying near nominal posture, minimizing joint velocities).
- Training Curriculum: A two-phase approach is used:
- Phase 1 (Base Policy): Train on a flat, rigid floor with randomized walking modes (Standing, Inplace, Forward) and target velocities. Includes dynamics randomization.
- Phase 2 (Terrain Finetuning): Fine-tune the base policy while introducing randomized terrains:
- Compliance Simulation: Randomize MuJoCo's
solref
parameter (specifically the time constant part) for the foot geoms within (0.02, 0.4)
every ~0.5s to simulate varying ground softness.
- Unevenness Simulation: Use MuJoCo height fields (pre-generated 10m×10m grid with $4cm$ resolution). Randomize the relative z-position of the height field w.r.t the flat floor within
(-4cm, 0cm)
every ~5s (disabled during double support) to create unevenness up to 4cm. Also randomize x,y position slightly.
- Dynamics Randomization: Crucially, dynamics parameters are randomized during each episode (every ~0.5s on average) rather than just at the start. This prevents overfitting to specific simulation dynamics and improves sim-to-real transfer. Randomized parameters include:
- Joint damping coefficients: (0.2, 5)
- Joint static friction: (2, 8) Nm
- Link masses: [0.95, 1.05] x default
- Link CoM: +/- 0.01m from default
Aperiodic Walking Policy:
The paper argues that the fixed periodicity imposed by the standard clock signal limits robustness. To address this, they propose an augmented policy that also predicts a phase offset, allowing it to modulate the gait cycle duration dynamically.
- Action Space: Increased from 12D to 13D, with the extra action aδϕ modifying the phase update:
ϕt+1=ϕt+clip(aδϕ,−5,5)+1
- Simulation Results:
- Achieves higher training rewards.
- Exhibits variable swing/stance durations based on terrain (aperiodic gait).
- Learns to shorten the gait cycle (e.g., ~1.55s vs default 2s when walking forward), potentially allowing faster reaction.
- Shows improved success rates on simulated terrains with unevenness greater than the 4cm seen during training (Table V).
- Limitation: Sim-to-real transfer for this clock-control policy proved difficult and was not demonstrated on the real robot, potentially due to the need for very accurate system identification or increased sensitivity to unmodeled dynamics.
Real-Robot Experiments (HRP-5P):
The standard policy (without clock control) was deployed zero-shot.
- Indoor: Tested on flat floor, soft gym mattress, cushion foam block, and rigid uneven blocks. Achieved a 67% success rate over 9 trials. Failures occurred due to triggering conservative safety limits (later relaxed), standing still on steep slopes, or encountering obstacles > 4cm (the training limit).
- Outdoor: Successfully walked ~25m on a paved street (slight incline) and ~30m on an irregular grass lawn (compliant and uneven).
- Ablations: Compared policies trained with different terrain variations:
- Flat-only policy failed quickly on any unevenness.
- Uneven-only policy failed on soft terrain.
- Fixed-compliance policy struggled on the flat, rigid floor.
- The proposed terrain-randomized policy (random compliance + unevenness) performed best across all tested terrains.
Conclusion & Practical Implications:
The work successfully demonstrates that a sim-to-real RL approach with a curated curriculum of randomized terrains and dynamics can produce a robust, unified walking controller for a heavy humanoid like HRP-5P on diverse surfaces, including compliant ones, using only proprioception.
- Implementation: The two-stage training curriculum and intra-episode dynamics randomization are key practical techniques. Simulating compliance via MuJoCo's
solref
and unevenness via height fields provides a blueprint for training.
- Performance: The controller adapts implicitly without needing terrain classification or parameter tuning. It handles moderate unevenness (up to ~4cm) and various compliances.
- Limitations: Blind locomotion limits handling of larger, unexpected obstacles. The 4cm training height directly caps real-world capability. Standing on slopes was challenging. The aperiodic policy shows promise but faces sim-to-real hurdles.
- Resources: Training took ~20 hours on a 32-core CPU. Inference is efficient enough for the robot's onboard PC. Source code is provided.