- The paper introduces a unified framework that combines diffusion-based motion generation with reinforcement learning to achieve robust whole-body humanoid locomotion.
- It employs a staged training process with offline pre-training and RL fine-tuning, effectively bridging the gap between expressive human motions and real-world control.
- Experimental results on the Unitree G1 robot demonstrate high success rates (>98%) and emergent adaptive behaviors in complex, unstructured terrains.
Learning Whole-Body Humanoid Locomotion via Motion Generation and Motion Tracking
Introduction
This paper addresses the complex challenge of whole-body perceptive humanoid locomotion in unstructured environments by proposing an integrated framework that merges diffusion-based motion generation with RL-based motion tracking. Traditional RL approaches often yield lower-body-dominated strategies that lack coordination required for nontrivial terrain traversal. Motion imitation methods improve skill transfer but are restricted in adaptability to new and diverse environments. The authors introduce a paradigm wherein retargeted human motions serve as the foundation to train a diffusion model for real-time, terrain-aware reference motion prediction. This generative component is coupled with a whole-body RL-based reference tracker, and subsequently, the system is fine-tuned to enhance robustness and generalization, thereby bridging the gap between expressive human-derived motion and robust real-world control.
Figure 1: Whole-body Unitree G1 robot locomotion, achieved via diffusion-driven motion generation and RL-based motion tracking, performing box climbing, vaulting, and mixed terrain traversal.
Methodology
Data Curation and Preprocessing
The pipeline initiates with the construction of a whole-body motion dataset derived from both proprietary human motion videos and large-scale public datasets. The data encompass key locomotion primitives—climbing, vaulting, jumping down, stair navigation, and omnidirectional walking. A contact-constrained IK solver retargets human motion to the robot’s kinematics, and further motion augmentation increases terrain diversity by perturbing obstacle geometries and adding physically plausible variations. The process ensures that the resultant trajectories are dynamically feasible for real robot execution.
Training Framework
The proposed system consists of three sequential stages:
- Pre-training: Offline reference motions are used to jointly train a diffusion-based motion generator and a DeepMimic-style RL-based motion tracker. The generator produces kinematic futures, conditioned on target direction, local terrain, and the recent past, whereas the tracker learns to output executable joint trajectories that closely track these references while incorporating exteroceptive terrain scans.
- RL Fine-tuning: To mitigate the distribution mismatch between offline-generated references and real deployment scenarios, the tracker is further fine-tuned with RL while the generator is frozen. This closed-loop adaptation exposes the controller to model imperfections and broader environmental variations, substantially elevating its robustness and performance in new environments.
- Onboard Deployment: The architecture is optimized for deployment using onboard LiDAR, IMU, and GPU-accelerated inference, reconstructing terrain in real time and updating references in a receding-horizon fashion.
Figure 2: Schematic of the training pipeline. Motion data is curated, the diffusion generator and RL tracker are pre-trained, and the tracker is subsequently fine-tuned within the closed-loop system.
Experimental Results
Hardware Validation
The integrated system is validated on the Unitree G1 robot in a range of physically challenging scenarios: high box climbing and descent (up to 75 cm), vaulting over obstacles of variable height, continuous stair traversal, and compound terrain sequences not observed during training or fine-tuning. The robot demonstrates emergent adaptive behaviors, utilizing its hands, knees, and dynamic reorientation to negotiate complex transitions. The local navigation capabilities are nontrivial—if confronted with an unfavorable initial state relative to the obstacle and target, the system deviates from the generator’s reference to circumvent the obstacle, illustrating closed-loop task adaptation.
Figure 3: Hardware experiment outcomes. The robot climbs boxes, executes jumps, traverses stairs, clears vaults, and navigates compound terrains, demonstrating versatile whole-body skills.
Quantitative Analysis
Simulation-based ablation studies reveal two decisive findings:
Theoretical and Practical Implications
This system advances the state-of-the-art in integrating generative models with policy optimization for real robot control. The results substantiate that diffusion-based motion generation, when tightly coupled with dissipative RL fine-tuning, enables robust and expressive whole-body motion across unpredictable and diverse terrain. Notably, the system circumvents the need for cumbersome hand-engineered skill composition, leveraging unified motion priors for rapid adaptation.
On the theoretical front, the framework exposes the criticality of closed-loop adaptation for handling perceptual uncertainties and model errors, pointing to a general design principle for deploying expressive generative models in hardware-constrained settings. The observed emergent behaviors underline the importance of interleaving policy optimization with structurally rich motion data, a concept broadly extensible to broader loco-manipulation tasks and scalable skill repertoires.
Limitations and Future Directions
While the presented approach demonstrates broad terrain generalization, it is perceptually bottlenecked by the fidelity of onboard LiDAR-based terrain reconstruction. Failures in perception can severely undermine the motion planning chain, motivating future research toward robust neural or multi-modal environmental representations—similar to methods involving attention-based map encoders or proprioceptive error correction. Additionally, extending the framework toward real-time loco-manipulation and multi-agent whole-body interactions presents a promising avenue.
Conclusion
This work presents a validated and deployable architecture for whole-body humanoid locomotion combining diffusion-based motion generation with reinforcement-learned motion tracking. The system robustly executes diverse, context-sensitive whole-body skills across nontrivial terrains, with extensive analyses underscoring the contribution of both trajectory generation and policy fine-tuning to generalization and robustness. Future exploration of more perceptually robust and general skill frameworks is warranted, aiming for fully autonomous operation in highly unstructured real-world settings.