Learning Whole-Body Humanoid Locomotion via Motion Generation and Motion Tracking

Published 19 Apr 2026 in cs.RO | (2604.17335v1)

Abstract: Whole-body humanoid locomotion is challenging due to high-dimensional control, morphological instability, and the need for real-time adaptation to various terrains using onboard perception. Directly applying reinforcement learning (RL) with reward shaping to humanoid locomotion often leads to lower-body-dominated behaviors, whereas imitation-based RL can learn more coordinated whole-body skills but is typically limited to replaying reference motions without a mechanism to adapt them online from perception for terrain-aware locomotion. To address this gap, we propose a whole-body humanoid locomotion framework that combines skills learned from reference motions with terrain-aware adaptation. We first train a diffusion model on retargeted human motions for real-time prediction of terrain-aware reference motions. Concurrently, we train a whole-body reference tracker with RL using this motion data. To improve robustness under imperfectly generated references, we further fine-tune the tracker with a frozen motion generator in a closed-loop setting. The resulting system supports directional goal-reaching control with terrain-aware whole-body adaptation, and can be deployed on a Unitree G1 humanoid robot with onboard perception and computation. The hardware experiments demonstrate successful traversal over boxes, hurdles, stairs, and mixed terrain combinations. Quantitative results further show the benefits of incorporating online motion generation and fine-tuning the motion tracker for improved generalization and robustness.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces a unified framework that combines diffusion-based motion generation with reinforcement learning to achieve robust whole-body humanoid locomotion.
It employs a staged training process with offline pre-training and RL fine-tuning, effectively bridging the gap between expressive human motions and real-world control.
Experimental results on the Unitree G1 robot demonstrate high success rates (>98%) and emergent adaptive behaviors in complex, unstructured terrains.

Learning Whole-Body Humanoid Locomotion via Motion Generation and Motion Tracking

Introduction

This paper addresses the complex challenge of whole-body perceptive humanoid locomotion in unstructured environments by proposing an integrated framework that merges diffusion-based motion generation with RL-based motion tracking. Traditional RL approaches often yield lower-body-dominated strategies that lack coordination required for nontrivial terrain traversal. Motion imitation methods improve skill transfer but are restricted in adaptability to new and diverse environments. The authors introduce a paradigm wherein retargeted human motions serve as the foundation to train a diffusion model for real-time, terrain-aware reference motion prediction. This generative component is coupled with a whole-body RL-based reference tracker, and subsequently, the system is fine-tuned to enhance robustness and generalization, thereby bridging the gap between expressive human-derived motion and robust real-world control.

Figure 1: Whole-body Unitree G1 robot locomotion, achieved via diffusion-driven motion generation and RL-based motion tracking, performing box climbing, vaulting, and mixed terrain traversal.

Methodology

Data Curation and Preprocessing

The pipeline initiates with the construction of a whole-body motion dataset derived from both proprietary human motion videos and large-scale public datasets. The data encompass key locomotion primitives—climbing, vaulting, jumping down, stair navigation, and omnidirectional walking. A contact-constrained IK solver retargets human motion to the robot’s kinematics, and further motion augmentation increases terrain diversity by perturbing obstacle geometries and adding physically plausible variations. The process ensures that the resultant trajectories are dynamically feasible for real robot execution.

Training Framework

The proposed system consists of three sequential stages:

Pre-training: Offline reference motions are used to jointly train a diffusion-based motion generator and a DeepMimic-style RL-based motion tracker. The generator produces kinematic futures, conditioned on target direction, local terrain, and the recent past, whereas the tracker learns to output executable joint trajectories that closely track these references while incorporating exteroceptive terrain scans.
RL Fine-tuning: To mitigate the distribution mismatch between offline-generated references and real deployment scenarios, the tracker is further fine-tuned with RL while the generator is frozen. This closed-loop adaptation exposes the controller to model imperfections and broader environmental variations, substantially elevating its robustness and performance in new environments.
Onboard Deployment: The architecture is optimized for deployment using onboard LiDAR, IMU, and GPU-accelerated inference, reconstructing terrain in real time and updating references in a receding-horizon fashion.
Figure 2: Schematic of the training pipeline. Motion data is curated, the diffusion generator and RL tracker are pre-trained, and the tracker is subsequently fine-tuned within the closed-loop system.

Experimental Results

Hardware Validation

The integrated system is validated on the Unitree G1 robot in a range of physically challenging scenarios: high box climbing and descent (up to 75 cm), vaulting over obstacles of variable height, continuous stair traversal, and compound terrain sequences not observed during training or fine-tuning. The robot demonstrates emergent adaptive behaviors, utilizing its hands, knees, and dynamic reorientation to negotiate complex transitions. The local navigation capabilities are nontrivial—if confronted with an unfavorable initial state relative to the obstacle and target, the system deviates from the generator’s reference to circumvent the obstacle, illustrating closed-loop task adaptation.

Figure 3: Hardware experiment outcomes. The robot climbs boxes, executes jumps, traverses stairs, clears vaults, and navigates compound terrains, demonstrating versatile whole-body skills.

Quantitative Analysis

Simulation-based ablation studies reveal two decisive findings:

Online Motion Generation: Comparative evaluation between fixed-reference tracking and terrain-conditioned online reference generation highlights that the inclusion of the motion generator significantly enhances success rates across unobserved terrain modifications. For large perturbations in obstacle height/angle, the static tracker’s performance collapses while the generator-augmented system maintains high success rates (mean >0.98), demonstrating critical improvements in generalization and adaptability.
RL Fine-tuning Importance: Analyses of the tracker with and without post-hoc fine-tuning demonstrate substantial improvements in robustness following adaptation to the distribution of generator-produced references. Notably, the augmentation yields consistent success across all tasks, including those with significant distribution shift and increased kinematic demands.
Figure 4: Success rate comparison with and without fine-tuning across five terrain traversal tasks. RL fine-tuning under frozen generator consistently boosts robustness and task completion.

Theoretical and Practical Implications

This system advances the state-of-the-art in integrating generative models with policy optimization for real robot control. The results substantiate that diffusion-based motion generation, when tightly coupled with dissipative RL fine-tuning, enables robust and expressive whole-body motion across unpredictable and diverse terrain. Notably, the system circumvents the need for cumbersome hand-engineered skill composition, leveraging unified motion priors for rapid adaptation.

On the theoretical front, the framework exposes the criticality of closed-loop adaptation for handling perceptual uncertainties and model errors, pointing to a general design principle for deploying expressive generative models in hardware-constrained settings. The observed emergent behaviors underline the importance of interleaving policy optimization with structurally rich motion data, a concept broadly extensible to broader loco-manipulation tasks and scalable skill repertoires.

Limitations and Future Directions

While the presented approach demonstrates broad terrain generalization, it is perceptually bottlenecked by the fidelity of onboard LiDAR-based terrain reconstruction. Failures in perception can severely undermine the motion planning chain, motivating future research toward robust neural or multi-modal environmental representations—similar to methods involving attention-based map encoders or proprioceptive error correction. Additionally, extending the framework toward real-time loco-manipulation and multi-agent whole-body interactions presents a promising avenue.

Conclusion

This work presents a validated and deployable architecture for whole-body humanoid locomotion combining diffusion-based motion generation with reinforcement-learned motion tracking. The system robustly executes diverse, context-sensitive whole-body skills across nontrivial terrains, with extensive analyses underscoring the contribution of both trajectory generation and policy fine-tuning to generalization and robustness. Future exploration of more perceptually robust and general skill frameworks is warranted, aiming for fully autonomous operation in highly unstructured real-world settings.

Markdown Report Issue