Learning Humanoid Standing-up Control across Diverse Postures (2502.08378v2)

Published 12 Feb 2025 in cs.RO, cs.AI, and cs.LG

Abstract: Standing-up control is crucial for humanoid robots, with the potential for integration into current locomotion and loco-manipulation systems, such as fall recovery. Existing approaches are either limited to simulations that overlook hardware constraints or rely on predefined ground-specific motion trajectories, failing to enable standing up across postures in real-world scenes. To bridge this gap, we present HoST (Humanoid Standing-up Control), a reinforcement learning framework that learns standing-up control from scratch, enabling robust sim-to-real transfer across diverse postures. HoST effectively learns posture-adaptive motions by leveraging a multi-critic architecture and curriculum-based training on diverse simulated terrains. To ensure successful real-world deployment, we constrain the motion with smoothness regularization and implicit motion speed bound to alleviate oscillatory and violent motions on physical hardware, respectively. After simulation-based training, the learned control policies are directly deployed on the Unitree G1 humanoid robot. Our experimental results demonstrate that the controllers achieve smooth, stable, and robust standing-up motions across a wide range of laboratory and outdoor environments. Videos and code are available at https://taohuang13.github.io/humanoid-standingup.github.io/.

Summary

The paper introduces HoST, a reinforcement learning framework that learns standing-up control from scratch without predefined trajectories.
It employs a multi-critic RL approach and a curriculum-based force exploration to achieve smooth, adaptive, and robust motions across diverse terrains.
Real robot experiments validate the framework with a 100% success rate, demonstrating effective sim-to-real transfer and resilience to disturbances.

The paper introduces HoST (Human*oid **Standing-up Control), a reinforcement learning (RL*) framework designed to enable humanoid robots to stand up from diverse postures and handle real-world environmental disturbances. The framework aims to bridge the gap between simulation and real-world application by training control policies from scratch without relying on predefined motion trajectories. The trained policies are directly deployed on the Unitree G1 humanoid robot.

The core contributions of this work include:

Achieving real-world posture-adaptive motions via RL without predefined trajectories or sim-to-real adaptation.
Demonstrating smoothness, stability, and robustness of the learned control policies under challenging external disturbances.
Elaborating evaluation protocols designed to comprehensively analyze standing-up control.

The standing-up control problem is formulated as a Markov decision process (MDP) $\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{R}, \gamma \rangle$ , where:

$\mathcal{S}$ $S$ is the state space. The state $s_t$ $s_{t}$ includes proprioceptive information from the robot's Inertial Measurement Unit (IMU) and joint encoders: $s_t=[\omega_t, r_t, p_t, \dot{p}_t, a_{t-1}, \beta]$ .
- $\omega_t$ is the angular velocity of the robot base.
- $r_t$ is the roll.
- $p_t$ is the pitch.
- $p_t$ and $\dot{p}_t$ are the positions and velocities of the joints.
- $a_{t-1}$ is the previous action.
- $\beta\in (0, 1]$ is a scalar that scales the action output.
$\mathcal{A}$ $A$ is the action space. The action $a_t$ $a_{t}$ represents the difference between the current and next-step joint positions, computed using a Proportional Derivative (PD) controller:
- $\tau_t = K_p\cdot(p_t^d - p_t) - K_d\cdot \dot{p}_t$
- $\tau_t$ is the torque at timestep $t$ .
- $K_p$ and $K_d$ represent the stiffness and damping coefficients of the PD controller.
- $p_t^d = p_t + \beta a_t$ is the PD target, with each dimension of $a_t$ constrained to $[-1, 1]$ .
$\mathcal{T}$ is the transition function.
$\mathcal{R}$ is the reward function.
$\gamma\in[0, 1]$ is the discount factor.

The framework addresses key challenges through several components:

Reward Design and Optimization: The standing-up task is divided into three stages based on the height of the robot base ( $h_\mathrm{base}$ $h_{base}$ ):
- Righting the body ( $h_\mathrm{base} < H_{\mathrm{stage1}}$ ).
- Kneeling.
- Rising the body ( $h_\mathrm{base} > H_{\mathrm{stage2}}$ ).
Reward functions are classified into four groups: task reward ( $r^{\mathrm{task}}$ ), style reward ( $r^{\mathrm{style}}$ ), regularization reward ( $r^{\mathrm{regu}}$ ), and post-task reward ( $r^{\mathrm{post}}$ ). The overall reward function is: * $r_t = w^{\mathrm{task}}\cdot r^{\mathrm{task}}_t + w^{\mathrm{style}}\cdot r^{\mathrm{style}}_t + w^{\mathrm{regu}}\cdot r^{\mathrm{regu}}_t + w^{\mathrm{post}}\cdot r^{\mathrm{post}}_t$

To optimize the reward functions, the paper uses multi-critic RL, where each reward group has its own assigned critic $V_{\phi_i}$ . The loss function for each critic is: * $\mathcal{L}(\phi_i) = \mathbb{E}\big[ \|r_{t}^i + \gamma V_{\phi_i}(s_t) - \bar{V}_{\phi_i}(s_{t+1})\|^2\big]$ * $r_{t}^i$ is the total reward for group $i$ . * $\bar{V}_{\phi_i}$ is the target value function of reward group $i$ .

The advantages are aggregated into an overall weighted advantage: $A = \sum_{i} w^i \cdot \frac{A_{\phi_i} - \mu_{A_{\phi_i}}}{\sigma_{A_{\phi_i}}}$ , where $\mu_{A_{\phi_i}}$ and $\sigma_{A_{\phi_i}}$ are the batch mean and standard deviation of each advantage. The critics are updated simultaneously with the policy network $\pi_\theta$ according to: * $\mathcal{L}(\theta) = \mathbb{E} \left[ \min \left( \alpha_t (\theta)A_t, \mathrm{clip}(\alpha_t(\theta), 1-\epsilon, 1+\epsilon)A_t \right) \right]$ * $\alpha_t(\theta)$ is the probability ratio. * $\epsilon$ is the clipping hyperparameter.
Exploration Strategy: A curriculum-based vertical pulling force $\mathcal{F}$ is applied to the robot base during the initial training stages to facilitate exploration. This force takes effect when the robot's trunk is near-vertical and decreases as the robot maintains a target height.
Motion Constraints:
- An action rescaler $\beta$ scales the action output to implicitly control the bound of the maximal torques on each actuator.
- Smoothness regularization is incorporated using the L2C2 method, which applies regularization to both the actor-network $\pi_\theta$ $π_{θ}$ and critics $V_{\phi_i}$ $V_{ϕ_{i}}$ :
  - $\mathcal{L}_{\mathrm{L2C2}} = \lambda_\pi D(\pi_\theta(s_t), \pi_\theta(\bar{s}_t)) + \lambda_V \sum D(V_{\phi_i}(s_t), V_{\phi_i}(\bar{s}_t))$
  - $D$ is a distance metric.
  - $\lambda_\pi$ and $\lambda_V$ are weight coefficients.
  - $\bar{s}_t = s_t + (s_{t+1} - s_t) \cdot u$ is the interpolated state, with $u\sim \mathcal{U}(\cdot)$ .
Sim-to-Real Transfer:
- Diverse terrains are designed to simulate real-world starting postures: ground, platform, wall, and slope.
- Domain randomization is applied to reduce the influence of physical discrepancies between simulation and real world, including body mass, base center of mass (CoM) offset, PD gains, torque offset, and initial pose.

The framework is implemented using the Isaac Gym simulator with 4096 parallel environments and the 23-DoF Unitree G1 robot. The actor and critic networks are 3-layer and 2-layer Multilayer Perceptrons (MLPs), respectively.

The paper introduces the following evaluation metrics for standing-up control:

Success rate $E_{\mathrm{succ}}$ .
Feet movement $E_{\mathrm{feet}}$ .
Motion smoothness $E_{\mathrm{smth}}$ .
Energy $E_{\mathrm{engy}}$ .

Ablation studies validate the effect of each component. The results show that multi-critic learning, force curriculum, and motion constraints contribute to successful and smooth standing-up motions. For example, without the proposed force curriculum, the robot fails to stand up on all terrains except the platform.

Trajectory analysis using Uniform Manifold Approximation and Projection (UMAP) visualizes the robot’s motion across diverse terrains, showing distinct and consistent motion patterns.

Robustness analysis demonstrates that the policies exhibit robustness across disturbances, achieving high success rates. For example, the policies exhibit remarkable robustness across all disturbances, achieving high success rates and efficient motion energy utilization.

Real robot experiments validate the simulation results, showing that smoothness regularization improves motions and the controllers generalize well to outdoor environments. The results in the real-world experiments show the method achieved a 100% success rate and high motion smoothness across all scenes. The sim-to-real analysis shows that domain randomization reduces the sim-to-real gap.

The controllers also demonstrate emergent properties such as robustness to external disturbances, fall recovery, and dynamic balance. For example, the controllers maintained stability even under significant disturbances, like objects disrupting the robot's center of gravity, and managed payloads up to 12kg.

PDF Markdown

Related Papers

Find Related Papers

GitHub

Learning Humanoid Standing-up Control across Diverse Postures

Tweets

https://twitter.com/TaouHuang/status/1890016250304270801

https://twitter.com/ChongZitaZhang/status/1889979170203291901

https://twitter.com/CyberRobooo/status/1890066949268537404

https://twitter.com/XiaoChen_Twi/status/1890016876853641650

https://twitter.com/Gdgtify/status/1890766766441128402

https://twitter.com/BenQingwei/status/1890017899060072618

YouTube

Show All Videos