- The paper introduces a two-phase RL framework that first discovers and then refines getting-up maneuvers for complex humanoid fall recovery.
- It overcomes challenges such as high-dimensional kinematics, complex contact dynamics, and sparse rewards using curriculum learning and domain randomization.
- The approach achieves successful sim-to-real transfer on the Unitree G1, enabling reliable recovery on varied terrains including flat, deformable, slippery, and sloped surfaces.
This paper addresses the critical challenge of automatic fall recovery for humanoid robots, a prerequisite for their reliable deployment in complex, real-world environments (2502.12152). Developing robust getting-up policies is non-trivial due to the high dimensionality of humanoid kinematics, the multitude of possible post-fall configurations, complex contact dynamics, and the variety of terrains encountered. Hand-crafting controllers for such scenarios is extremely difficult. The authors propose a learning framework based on reinforcement learning (RL) and a two-phase curriculum to generate controllers that enable a human-sized humanoid robot (the Unitree G1) to get up from various configurations on diverse terrains, demonstrating successful real-world deployment.
Challenges in Learning Getting-Up Policies
Learning getting-up behaviors presents distinct challenges compared to learning locomotion gaits. While locomotion often involves somewhat periodic motions and predictable contact sequences, getting up requires navigating complex, non-periodic, and often unstable intermediate states. Key difficulties include:
- Complex Contact Interactions: The robot interacts extensively and unpredictably with the ground using multiple body parts (limbs, torso). Accurate modeling of the robot's collision geometry and the resulting contact forces is crucial for successful simulation and sim-to-real transfer. This contrasts with locomotion where contact is often primarily through the feet.
- Sparse Rewards: Defining a dense reward signal that effectively guides the learning process from an arbitrary fallen state to a standing posture is difficult. Simple rewards based only on final success (e.g., reaching a target torso height) are often too sparse for standard RL algorithms to learn efficiently.
- High-Dimensional State and Action Spaces: Humanoid robots possess many degrees of freedom (DoF), leading to large state and action spaces that complicate exploration and policy optimization.
- Real-World Constraints: Policies deployed on physical hardware must respect joint torque limits, velocity limits, and produce smooth motions to avoid damaging the robot and ensure stability. They must also be robust to variations in initial configurations and terrain properties (friction, compliance, slope).
Two-Phase Curriculum Learning Framework
To address these challenges, the paper introduces a two-phase learning approach structured as a curriculum:
Phase 1: Discovery Phase
- Objective: The primary goal of this phase is to discover feasible getting-up trajectories from specified initial fallen configurations (e.g., face up, face down) to a target standing height, without strict adherence to real-world deployability constraints.
- Methodology: Deep reinforcement learning, specifically Proximal Policy Optimization (PPO), is employed in simulation. The focus is on achieving the goal state (standing) as quickly as possible.
- Relaxed Constraints: Constraints on motion smoothness, joint velocities, and torques are significantly relaxed or removed. This allows the RL agent to explore aggressive and potentially jerky movements that might otherwise be penalized, facilitating the discovery of effective strategies for overcoming challenging intermediate states (e.g., breaking contact symmetry, leveraging momentum).
- Reward Function: The reward function primarily incentivizes increasing the height of the robot's torso or head (ztorso) and reaching a target height (ztarget), potentially with a penalty for excessive simulation time. A representative reward term might look like:
Rdiscovery=wheight⋅Δztorso+wtarget⋅I(ztorso>ztarget)−wtime
where I(⋅) is the indicator function and w are weights. Additional terms might penalize unstable states slightly.
- Simulation: Accurate modeling of collision meshes and contact parameters is emphasized in this phase to ensure the discovered strategies are physically plausible, even if not immediately deployable.
Phase 2: Refinement Phase
- Objective: This phase refines the trajectories discovered in Phase 1 into smooth, slow, and robust policies suitable for deployment on the real robot.
- Methodology: The successful trajectories from Phase 1 serve as reference motions or demonstrations. A new policy is trained using PPO, again in simulation, but with a different reward function and the incorporation of domain randomization.
- Reward Function: The reward function is redesigned to heavily penalize jerky motions and excessive torques/velocities, while rewarding tracking of the reference trajectory from Phase 1 and maintaining stability. Key components include:
- Smoothness Penalties: Penalties on action rate (∥αt−αt−1∥2), torque rate (∥τt−τt−1∥2), and potentially joint acceleration.
- Deployability Incentives: Penalties for exceeding joint torque limits, velocity limits, and potentially energy consumption. Incentives for slower, more controlled movements might be added.
- Reference Tracking: Reward for staying close to the kinematic state sequence (qref,q˙ref) obtained from Phase 1. Rtrack=exp(−∥qt−qref,t∥2−∥q˙t−q˙ref,t∥2)
- Task Success: Maintaining reward for reaching and maintaining the target standing posture.
- The overall reward combines these terms: Rrefine=wtrackRtrack+wsmoothRsmooth+wdeployRdeploy+wtaskRtask
- Domain Randomization: To enhance robustness for real-world deployment, various physical and environmental parameters are randomized during training in this phase. This includes:
- Robot dynamics parameters (masses, inertias, motor friction).
- Terrain properties (friction coefficient, compliance/stiffness, slope angle).
- Sensor noise and latency.
- Initial state variations around the nominal fallen poses.
- Policy Architecture: A Multi-Layer Perceptron (MLP) is typically used for the policy network, mapping observations to actions.
Implementation Details
- Simulation Environment: The specific simulator used is not explicitly stated in the provided abstract, but platforms like MuJoCo or Isaac Gym are common choices for this type of work due to their handling of complex contacts and potential for GPU acceleration. Accurate collision meshes representing the G1 robot's geometry are essential.
- State Representation (Ot): The observation space likely includes:
- Joint positions (qt) and velocities (q˙t).
- Torso orientation (e.g., quaternion) and angular velocity (ωtorso).
- Root position (proot) and linear velocity (vroot), potentially relative to a target frame.
- Previous action (αt−1).
- Potentially, terrain parameters if adaptive behavior is desired (though the abstract suggests robustness via randomization rather than explicit adaptation).
- Clock/phase variable if cyclic motions were involved (less likely for getting up).
- Action Space (At): The action space typically consists of target positions for the robot's joint PD controllers. The policy outputs target joint angles αt, and low-level controllers compute torques: τ=Kp(αt−qt)−Kdq˙t.
- Training: PPO is used with standard hyperparameters, likely involving large batch sizes collected from parallel simulation environments. The two-phase structure implies two distinct training runs. Phase 1 generates reference trajectories, which are then used to initialize or guide the training in Phase 2.
Real-World Deployment and Results
The framework was successfully deployed on a Unitree G1 humanoid robot.
- Sim-to-Real Transfer: The robustness achieved through domain randomization in Phase 2 proved sufficient for zero-shot or few-shot transfer to the physical robot without extensive real-world fine-tuning. The accurate collision modeling in simulation was likely a key enabler.
- Experimental Validation: The learned policies enabled the G1 robot to successfully get up from two primary initial configurations: lying face up and lying face down.
- Terrain Robustness: Crucially, the policies demonstrated robustness across various challenging terrains beyond simple flat ground. Successful trials were conducted on:
- Flat surfaces
- Deformable surfaces (e.g., grass)
- Slippery surfaces (e.g., snowfield)
- Sloped surfaces (combining slope with potentially deformable/slippery characteristics)
- Key Claim: The authors state this is the first demonstration of learned getting-up policies for human-sized humanoid robots successfully operating in real-world conditions on varied terrains. This signifies a notable advancement in applying RL to complex, contact-rich tasks for full-scale humanoids. While success rates or quantitative metrics comparing terrains are not detailed in the abstract, the accompanying project page (humanoid-getup.github.io) likely provides visual evidence and potentially more detailed results.
Conclusion
The work presents a significant step towards autonomous humanoid operation by tackling the essential problem of fall recovery (2502.12152). The proposed two-phase learning framework effectively decomposes the problem: first discovering dynamically feasible maneuvers by relaxing constraints, and then refining these into smooth, robust, and deployable policies using reference tracking and domain randomization. The successful real-world deployment on a G1 humanoid across challenging terrains underscores the practicality and effectiveness of the approach, paving the way for more resilient humanoid robots capable of operating autonomously in unstructured environments.