Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
123 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Learning Humanoid Standing-up Control across Diverse Postures (2502.08378v2)

Published 12 Feb 2025 in cs.RO, cs.AI, and cs.LG

Abstract: Standing-up control is crucial for humanoid robots, with the potential for integration into current locomotion and loco-manipulation systems, such as fall recovery. Existing approaches are either limited to simulations that overlook hardware constraints or rely on predefined ground-specific motion trajectories, failing to enable standing up across postures in real-world scenes. To bridge this gap, we present HoST (Humanoid Standing-up Control), a reinforcement learning framework that learns standing-up control from scratch, enabling robust sim-to-real transfer across diverse postures. HoST effectively learns posture-adaptive motions by leveraging a multi-critic architecture and curriculum-based training on diverse simulated terrains. To ensure successful real-world deployment, we constrain the motion with smoothness regularization and implicit motion speed bound to alleviate oscillatory and violent motions on physical hardware, respectively. After simulation-based training, the learned control policies are directly deployed on the Unitree G1 humanoid robot. Our experimental results demonstrate that the controllers achieve smooth, stable, and robust standing-up motions across a wide range of laboratory and outdoor environments. Videos and code are available at https://taohuang13.github.io/humanoid-standingup.github.io/.

Summary

  • The paper introduces HoST, a reinforcement learning framework that learns standing-up control from scratch without predefined trajectories.
  • It employs a multi-critic RL approach and a curriculum-based force exploration to achieve smooth, adaptive, and robust motions across diverse terrains.
  • Real robot experiments validate the framework with a 100% success rate, demonstrating effective sim-to-real transfer and resilience to disturbances.

The paper introduces HoST (Human*oid **Standing-up Control), a reinforcement learning (RL*) framework designed to enable humanoid robots to stand up from diverse postures and handle real-world environmental disturbances. The framework aims to bridge the gap between simulation and real-world application by training control policies from scratch without relying on predefined motion trajectories. The trained policies are directly deployed on the Unitree G1 humanoid robot.

The core contributions of this work include:

  • Achieving real-world posture-adaptive motions via RL without predefined trajectories or sim-to-real adaptation.
  • Demonstrating smoothness, stability, and robustness of the learned control policies under challenging external disturbances.
  • Elaborating evaluation protocols designed to comprehensively analyze standing-up control.

The standing-up control problem is formulated as a Markov decision process (MDP) M=S,A,T,R,γ\mathcal{M} = \langle \mathcal{S}, \mathcal{A}, \mathcal{T}, \mathcal{R}, \gamma \rangle, where:

  • S\mathcal{S} is the state space. The state sts_t includes proprioceptive information from the robot's Inertial Measurement Unit (IMU) and joint encoders: st=[ωt,rt,pt,p˙t,at1,β]s_t=[\omega_t, r_t, p_t, \dot{p}_t, a_{t-1}, \beta].
    • ωt\omega_t is the angular velocity of the robot base.
    • rtr_t is the roll.
    • ptp_t is the pitch.
    • ptp_t and p˙t\dot{p}_t are the positions and velocities of the joints.
    • at1a_{t-1} is the previous action.
    • β(0,1]\beta\in (0, 1] is a scalar that scales the action output.
  • A\mathcal{A} is the action space. The action ata_t represents the difference between the current and next-step joint positions, computed using a Proportional Derivative (PD) controller:
    • τt=Kp(ptdpt)Kdp˙t\tau_t = K_p\cdot(p_t^d - p_t) - K_d\cdot \dot{p}_t
    • τt\tau_t is the torque at timestep tt.
    • KpK_p and KdK_d represent the stiffness and damping coefficients of the PD controller.
    • ptd=pt+βatp_t^d = p_t + \beta a_t is the PD target, with each dimension of ata_t constrained to [1,1][-1, 1].
  • T\mathcal{T} is the transition function.
  • R\mathcal{R} is the reward function.
  • γ[0,1]\gamma\in[0, 1] is the discount factor.

The framework addresses key challenges through several components:

  1. Reward Design and Optimization: The standing-up task is divided into three stages based on the height of the robot base (hbaseh_\mathrm{base}):
    • Righting the body (hbase<Hstage1h_\mathrm{base} < H_{\mathrm{stage1}}).
    • Kneeling.
    • Rising the body (hbase>Hstage2h_\mathrm{base} > H_{\mathrm{stage2}}).

    Reward functions are classified into four groups: task reward (rtaskr^{\mathrm{task}}), style reward (rstyler^{\mathrm{style}}), regularization reward (rregur^{\mathrm{regu}}), and post-task reward (rpostr^{\mathrm{post}}). The overall reward function is: * rt=wtaskrttask+wstylertstyle+wregurtregu+wpostrtpostr_t = w^{\mathrm{task}}\cdot r^{\mathrm{task}}_t + w^{\mathrm{style}}\cdot r^{\mathrm{style}}_t + w^{\mathrm{regu}}\cdot r^{\mathrm{regu}}_t + w^{\mathrm{post}}\cdot r^{\mathrm{post}}_t

    To optimize the reward functions, the paper uses multi-critic RL, where each reward group has its own assigned critic VϕiV_{\phi_i}. The loss function for each critic is: * L(ϕi)=E[rti+γVϕi(st)Vˉϕi(st+1)2]\mathcal{L}(\phi_i) = \mathbb{E}\big[ \|r_{t}^i + \gamma V_{\phi_i}(s_t) - \bar{V}_{\phi_i}(s_{t+1})\|^2\big] * rtir_{t}^i is the total reward for group ii. * Vˉϕi\bar{V}_{\phi_i} is the target value function of reward group ii.

    The advantages are aggregated into an overall weighted advantage: A=iwiAϕiμAϕiσAϕiA = \sum_{i} w^i \cdot \frac{A_{\phi_i} - \mu_{A_{\phi_i}}}{\sigma_{A_{\phi_i}}}, where μAϕi\mu_{A_{\phi_i}} and σAϕi\sigma_{A_{\phi_i}} are the batch mean and standard deviation of each advantage. The critics are updated simultaneously with the policy network πθ\pi_\theta according to: * L(θ)=E[min(αt(θ)At,clip(αt(θ),1ϵ,1+ϵ)At)]\mathcal{L}(\theta) = \mathbb{E} \left[ \min \left( \alpha_t (\theta)A_t, \mathrm{clip}(\alpha_t(\theta), 1-\epsilon, 1+\epsilon)A_t \right) \right] * αt(θ)\alpha_t(\theta) is the probability ratio. * ϵ\epsilon is the clipping hyperparameter.

  2. Exploration Strategy: A curriculum-based vertical pulling force F\mathcal{F} is applied to the robot base during the initial training stages to facilitate exploration. This force takes effect when the robot's trunk is near-vertical and decreases as the robot maintains a target height.

  3. Motion Constraints:

    • An action rescaler β\beta scales the action output to implicitly control the bound of the maximal torques on each actuator.
    • Smoothness regularization is incorporated using the L2C2 method, which applies regularization to both the actor-network πθ\pi_\theta and critics VϕiV_{\phi_i}:
      • LL2C2=λπD(πθ(st),πθ(sˉt))+λVD(Vϕi(st),Vϕi(sˉt))\mathcal{L}_{\mathrm{L2C2}} = \lambda_\pi D(\pi_\theta(s_t), \pi_\theta(\bar{s}_t)) + \lambda_V \sum D(V_{\phi_i}(s_t), V_{\phi_i}(\bar{s}_t))
      • DD is a distance metric.
      • λπ\lambda_\pi and λV\lambda_V are weight coefficients.
      • sˉt=st+(st+1st)u\bar{s}_t = s_t + (s_{t+1} - s_t) \cdot u is the interpolated state, with uU()u\sim \mathcal{U}(\cdot).
  4. Sim-to-Real Transfer:
    • Diverse terrains are designed to simulate real-world starting postures: ground, platform, wall, and slope.
    • Domain randomization is applied to reduce the influence of physical discrepancies between simulation and real world, including body mass, base center of mass (CoM) offset, PD gains, torque offset, and initial pose.

The framework is implemented using the Isaac Gym simulator with 4096 parallel environments and the 23-DoF Unitree G1 robot. The actor and critic networks are 3-layer and 2-layer Multilayer Perceptrons (MLPs), respectively.

The paper introduces the following evaluation metrics for standing-up control:

  • Success rate EsuccE_{\mathrm{succ}}.
  • Feet movement EfeetE_{\mathrm{feet}}.
  • Motion smoothness EsmthE_{\mathrm{smth}}.
  • Energy EengyE_{\mathrm{engy}}.

Ablation studies validate the effect of each component. The results show that multi-critic learning, force curriculum, and motion constraints contribute to successful and smooth standing-up motions. For example, without the proposed force curriculum, the robot fails to stand up on all terrains except the platform.

Trajectory analysis using Uniform Manifold Approximation and Projection (UMAP) visualizes the robot’s motion across diverse terrains, showing distinct and consistent motion patterns.

Robustness analysis demonstrates that the policies exhibit robustness across disturbances, achieving high success rates. For example, the policies exhibit remarkable robustness across all disturbances, achieving high success rates and efficient motion energy utilization.

Real robot experiments validate the simulation results, showing that smoothness regularization improves motions and the controllers generalize well to outdoor environments. The results in the real-world experiments show the method achieved a 100% success rate and high motion smoothness across all scenes. The sim-to-real analysis shows that domain randomization reduces the sim-to-real gap.

The controllers also demonstrate emergent properties such as robustness to external disturbances, fall recovery, and dynamic balance. For example, the controllers maintained stability even under significant disturbances, like objects disrupting the robot's center of gravity, and managed payloads up to 12kg.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com