- The paper introduces HoST, a reinforcement learning framework that learns standing-up control from scratch without predefined trajectories.
- It employs a multi-critic RL approach and a curriculum-based force exploration to achieve smooth, adaptive, and robust motions across diverse terrains.
- Real robot experiments validate the framework with a 100% success rate, demonstrating effective sim-to-real transfer and resilience to disturbances.
The paper introduces HoST (Human*oid **Standing-up Control), a reinforcement learning (RL*) framework designed to enable humanoid robots to stand up from diverse postures and handle real-world environmental disturbances. The framework aims to bridge the gap between simulation and real-world application by training control policies from scratch without relying on predefined motion trajectories. The trained policies are directly deployed on the Unitree G1 humanoid robot.
The core contributions of this work include:
- Achieving real-world posture-adaptive motions via RL without predefined trajectories or sim-to-real adaptation.
- Demonstrating smoothness, stability, and robustness of the learned control policies under challenging external disturbances.
- Elaborating evaluation protocols designed to comprehensively analyze standing-up control.
The standing-up control problem is formulated as a Markov decision process (MDP) M=⟨S,A,T,R,γ⟩, where:
- S is the state space. The state st includes proprioceptive information from the robot's Inertial Measurement Unit (IMU) and joint encoders: st=[ωt,rt,pt,p˙t,at−1,β].
- ωt is the angular velocity of the robot base.
- rt is the roll.
- pt is the pitch.
- pt and p˙t are the positions and velocities of the joints.
- at−1 is the previous action.
- β∈(0,1] is a scalar that scales the action output.
- A is the action space. The action at represents the difference between the current and next-step joint positions, computed using a Proportional Derivative (PD) controller:
- τt=Kp⋅(ptd−pt)−Kd⋅p˙t
- τt is the torque at timestep t.
- Kp and Kd represent the stiffness and damping coefficients of the PD controller.
- ptd=pt+βat is the PD target, with each dimension of at constrained to [−1,1].
- T is the transition function.
- R is the reward function.
- γ∈[0,1] is the discount factor.
The framework addresses key challenges through several components:
- Reward Design and Optimization: The standing-up task is divided into three stages based on the height of the robot base (hbase):
- Righting the body (hbase<Hstage1).
- Kneeling.
- Rising the body (hbase>Hstage2).
Reward functions are classified into four groups: task reward (rtask), style reward (rstyle), regularization reward (rregu), and post-task reward (rpost). The overall reward function is:
* rt=wtask⋅rttask+wstyle⋅rtstyle+wregu⋅rtregu+wpost⋅rtpost
To optimize the reward functions, the paper uses multi-critic RL, where each reward group has its own assigned critic Vϕi. The loss function for each critic is:
* L(ϕi)=E[∥rti+γVϕi(st)−Vˉϕi(st+1)∥2]
* rti is the total reward for group i.
* Vˉϕi is the target value function of reward group i.
The advantages are aggregated into an overall weighted advantage: A=i∑wi⋅σAϕiAϕi−μAϕi, where μAϕi and σAϕi are the batch mean and standard deviation of each advantage. The critics are updated simultaneously with the policy network πθ according to:
* L(θ)=E[min(αt(θ)At,clip(αt(θ),1−ϵ,1+ϵ)At)]
* αt(θ) is the probability ratio.
* ϵ is the clipping hyperparameter.
Exploration Strategy: A curriculum-based vertical pulling force F is applied to the robot base during the initial training stages to facilitate exploration. This force takes effect when the robot's trunk is near-vertical and decreases as the robot maintains a target height.
Motion Constraints:
- An action rescaler β scales the action output to implicitly control the bound of the maximal torques on each actuator.
- Smoothness regularization is incorporated using the L2C2 method, which applies regularization to both the actor-network πθ and critics Vϕi:
- LL2C2=λπD(πθ(st),πθ(sˉt))+λV∑D(Vϕi(st),Vϕi(sˉt))
- D is a distance metric.
- λπ and λV are weight coefficients.
- sˉt=st+(st+1−st)⋅u is the interpolated state, with u∼U(⋅).
- Sim-to-Real Transfer:
- Diverse terrains are designed to simulate real-world starting postures: ground, platform, wall, and slope.
- Domain randomization is applied to reduce the influence of physical discrepancies between simulation and real world, including body mass, base center of mass (CoM) offset, PD gains, torque offset, and initial pose.
The framework is implemented using the Isaac Gym simulator with 4096 parallel environments and the 23-DoF Unitree G1 robot. The actor and critic networks are 3-layer and 2-layer Multilayer Perceptrons (MLPs), respectively.
The paper introduces the following evaluation metrics for standing-up control:
- Success rate Esucc.
- Feet movement Efeet.
- Motion smoothness Esmth.
- Energy Eengy.
Ablation studies validate the effect of each component. The results show that multi-critic learning, force curriculum, and motion constraints contribute to successful and smooth standing-up motions. For example, without the proposed force curriculum, the robot fails to stand up on all terrains except the platform.
Trajectory analysis using Uniform Manifold Approximation and Projection (UMAP) visualizes the robot’s motion across diverse terrains, showing distinct and consistent motion patterns.
Robustness analysis demonstrates that the policies exhibit robustness across disturbances, achieving high success rates. For example, the policies exhibit remarkable robustness across all disturbances, achieving high success rates and efficient motion energy utilization.
Real robot experiments validate the simulation results, showing that smoothness regularization improves motions and the controllers generalize well to outdoor environments. The results in the real-world experiments show the method achieved a 100% success rate and high motion smoothness across all scenes. The sim-to-real analysis shows that domain randomization reduces the sim-to-real gap.
The controllers also demonstrate emergent properties such as robustness to external disturbances, fall recovery, and dynamic balance. For example, the controllers maintained stability even under significant disturbances, like objects disrupting the robot's center of gravity, and managed payloads up to 12kg.