Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
123 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Learning Getting-Up Policies for Real-World Humanoid Robots (2502.12152v2)

Published 17 Feb 2025 in cs.RO and cs.LG

Abstract: Automatic fall recovery is a crucial prerequisite before humanoid robots can be reliably deployed. Hand-designing controllers for getting up is difficult because of the varied configurations a humanoid can end up in after a fall and the challenging terrains humanoid robots are expected to operate on. This paper develops a learning framework to produce controllers that enable humanoid robots to get up from varying configurations on varying terrains. Unlike previous successful applications of learning to humanoid locomotion, the getting-up task involves complex contact patterns (which necessitates accurately modeling of the collision geometry) and sparser rewards. We address these challenges through a two-phase approach that induces a curriculum. The first stage focuses on discovering a good getting-up trajectory under minimal constraints on smoothness or speed / torque limits. The second stage then refines the discovered motions into deployable (i.e. smooth and slow) motions that are robust to variations in initial configuration and terrains. We find these innovations enable a real-world G1 humanoid robot to get up from two main situations that we considered: a) lying face up and b) lying face down, both tested on flat, deformable, slippery surfaces and slopes (e.g., sloppy grass and snowfield). This is one of the first successful demonstrations of learned getting-up policies for human-sized humanoid robots in the real world.

Summary

  • The paper introduces a two-phase RL framework that first discovers and then refines getting-up maneuvers for complex humanoid fall recovery.
  • It overcomes challenges such as high-dimensional kinematics, complex contact dynamics, and sparse rewards using curriculum learning and domain randomization.
  • The approach achieves successful sim-to-real transfer on the Unitree G1, enabling reliable recovery on varied terrains including flat, deformable, slippery, and sloped surfaces.

This paper addresses the critical challenge of automatic fall recovery for humanoid robots, a prerequisite for their reliable deployment in complex, real-world environments (2502.12152). Developing robust getting-up policies is non-trivial due to the high dimensionality of humanoid kinematics, the multitude of possible post-fall configurations, complex contact dynamics, and the variety of terrains encountered. Hand-crafting controllers for such scenarios is extremely difficult. The authors propose a learning framework based on reinforcement learning (RL) and a two-phase curriculum to generate controllers that enable a human-sized humanoid robot (the Unitree G1) to get up from various configurations on diverse terrains, demonstrating successful real-world deployment.

Challenges in Learning Getting-Up Policies

Learning getting-up behaviors presents distinct challenges compared to learning locomotion gaits. While locomotion often involves somewhat periodic motions and predictable contact sequences, getting up requires navigating complex, non-periodic, and often unstable intermediate states. Key difficulties include:

  1. Complex Contact Interactions: The robot interacts extensively and unpredictably with the ground using multiple body parts (limbs, torso). Accurate modeling of the robot's collision geometry and the resulting contact forces is crucial for successful simulation and sim-to-real transfer. This contrasts with locomotion where contact is often primarily through the feet.
  2. Sparse Rewards: Defining a dense reward signal that effectively guides the learning process from an arbitrary fallen state to a standing posture is difficult. Simple rewards based only on final success (e.g., reaching a target torso height) are often too sparse for standard RL algorithms to learn efficiently.
  3. High-Dimensional State and Action Spaces: Humanoid robots possess many degrees of freedom (DoF), leading to large state and action spaces that complicate exploration and policy optimization.
  4. Real-World Constraints: Policies deployed on physical hardware must respect joint torque limits, velocity limits, and produce smooth motions to avoid damaging the robot and ensure stability. They must also be robust to variations in initial configurations and terrain properties (friction, compliance, slope).

Two-Phase Curriculum Learning Framework

To address these challenges, the paper introduces a two-phase learning approach structured as a curriculum:

Phase 1: Discovery Phase

  • Objective: The primary goal of this phase is to discover feasible getting-up trajectories from specified initial fallen configurations (e.g., face up, face down) to a target standing height, without strict adherence to real-world deployability constraints.
  • Methodology: Deep reinforcement learning, specifically Proximal Policy Optimization (PPO), is employed in simulation. The focus is on achieving the goal state (standing) as quickly as possible.
  • Relaxed Constraints: Constraints on motion smoothness, joint velocities, and torques are significantly relaxed or removed. This allows the RL agent to explore aggressive and potentially jerky movements that might otherwise be penalized, facilitating the discovery of effective strategies for overcoming challenging intermediate states (e.g., breaking contact symmetry, leveraging momentum).
  • Reward Function: The reward function primarily incentivizes increasing the height of the robot's torso or head (ztorsoz_{torso}) and reaching a target height (ztargetz_{target}), potentially with a penalty for excessive simulation time. A representative reward term might look like:

    Rdiscovery=wheightΔztorso+wtargetI(ztorso>ztarget)wtimeR_{discovery} = w_{height} \cdot \Delta z_{torso} + w_{target} \cdot \mathbb{I}(z_{torso} > z_{target}) - w_{time}

    where I()\mathbb{I}(\cdot) is the indicator function and ww are weights. Additional terms might penalize unstable states slightly.

  • Simulation: Accurate modeling of collision meshes and contact parameters is emphasized in this phase to ensure the discovered strategies are physically plausible, even if not immediately deployable.

Phase 2: Refinement Phase

  • Objective: This phase refines the trajectories discovered in Phase 1 into smooth, slow, and robust policies suitable for deployment on the real robot.
  • Methodology: The successful trajectories from Phase 1 serve as reference motions or demonstrations. A new policy is trained using PPO, again in simulation, but with a different reward function and the incorporation of domain randomization.
  • Reward Function: The reward function is redesigned to heavily penalize jerky motions and excessive torques/velocities, while rewarding tracking of the reference trajectory from Phase 1 and maintaining stability. Key components include:
    • Smoothness Penalties: Penalties on action rate (αtαt12\|\alpha_t - \alpha_{t-1}\|^2), torque rate (τtτt12\|\tau_t - \tau_{t-1}\|^2), and potentially joint acceleration.
    • Deployability Incentives: Penalties for exceeding joint torque limits, velocity limits, and potentially energy consumption. Incentives for slower, more controlled movements might be added.
    • Reference Tracking: Reward for staying close to the kinematic state sequence (qref,q˙refq_{ref}, \dot{q}_{ref}) obtained from Phase 1. Rtrack=exp(qtqref,t2q˙tq˙ref,t2)R_{track} = \exp(-\|q_t - q_{ref, t}\|^2 - \|\dot{q}_t - \dot{q}_{ref, t}\|^2)
    • Task Success: Maintaining reward for reaching and maintaining the target standing posture.
    • The overall reward combines these terms: Rrefine=wtrackRtrack+wsmoothRsmooth+wdeployRdeploy+wtaskRtaskR_{refine} = w_{track} R_{track} + w_{smooth} R_{smooth} + w_{deploy} R_{deploy} + w_{task} R_{task}
  • Domain Randomization: To enhance robustness for real-world deployment, various physical and environmental parameters are randomized during training in this phase. This includes:
    • Robot dynamics parameters (masses, inertias, motor friction).
    • Terrain properties (friction coefficient, compliance/stiffness, slope angle).
    • Sensor noise and latency.
    • Initial state variations around the nominal fallen poses.
  • Policy Architecture: A Multi-Layer Perceptron (MLP) is typically used for the policy network, mapping observations to actions.

Implementation Details

  • Simulation Environment: The specific simulator used is not explicitly stated in the provided abstract, but platforms like MuJoCo or Isaac Gym are common choices for this type of work due to their handling of complex contacts and potential for GPU acceleration. Accurate collision meshes representing the G1 robot's geometry are essential.
  • State Representation (OtO_t): The observation space likely includes:
    • Joint positions (qtq_t) and velocities (q˙t\dot{q}_t).
    • Torso orientation (e.g., quaternion) and angular velocity (ωtorso\omega_{torso}).
    • Root position (prootp_{root}) and linear velocity (vrootv_{root}), potentially relative to a target frame.
    • Previous action (αt1\alpha_{t-1}).
    • Potentially, terrain parameters if adaptive behavior is desired (though the abstract suggests robustness via randomization rather than explicit adaptation).
    • Clock/phase variable if cyclic motions were involved (less likely for getting up).
  • Action Space (AtA_t): The action space typically consists of target positions for the robot's joint PD controllers. The policy outputs target joint angles αt\alpha_t, and low-level controllers compute torques: τ=Kp(αtqt)Kdq˙t\tau = K_p ( \alpha_t - q_t ) - K_d \dot{q}_t.
  • Training: PPO is used with standard hyperparameters, likely involving large batch sizes collected from parallel simulation environments. The two-phase structure implies two distinct training runs. Phase 1 generates reference trajectories, which are then used to initialize or guide the training in Phase 2.

Real-World Deployment and Results

The framework was successfully deployed on a Unitree G1 humanoid robot.

  • Sim-to-Real Transfer: The robustness achieved through domain randomization in Phase 2 proved sufficient for zero-shot or few-shot transfer to the physical robot without extensive real-world fine-tuning. The accurate collision modeling in simulation was likely a key enabler.
  • Experimental Validation: The learned policies enabled the G1 robot to successfully get up from two primary initial configurations: lying face up and lying face down.
  • Terrain Robustness: Crucially, the policies demonstrated robustness across various challenging terrains beyond simple flat ground. Successful trials were conducted on:
    • Flat surfaces
    • Deformable surfaces (e.g., grass)
    • Slippery surfaces (e.g., snowfield)
    • Sloped surfaces (combining slope with potentially deformable/slippery characteristics)
  • Key Claim: The authors state this is the first demonstration of learned getting-up policies for human-sized humanoid robots successfully operating in real-world conditions on varied terrains. This signifies a notable advancement in applying RL to complex, contact-rich tasks for full-scale humanoids. While success rates or quantitative metrics comparing terrains are not detailed in the abstract, the accompanying project page (humanoid-getup.github.io) likely provides visual evidence and potentially more detailed results.

Conclusion

The work presents a significant step towards autonomous humanoid operation by tackling the essential problem of fall recovery (2502.12152). The proposed two-phase learning framework effectively decomposes the problem: first discovering dynamically feasible maneuvers by relaxing constraints, and then refining these into smooth, robust, and deployable policies using reference tracking and domain randomization. The successful real-world deployment on a G1 humanoid across challenging terrains underscores the practicality and effectiveness of the approach, paving the way for more resilient humanoid robots capable of operating autonomously in unstructured environments.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com