- The paper presents a novel terrain adaptation approach that integrates TCRS to synthesize terrain‐specific motion references from human motion clips.
- It employs a Transformer-based teacher policy and distillation into a vision-conditioned student to achieve robust, real-time terrain-adjusted control.
- Empirical results show significant reductions in foot penetration (56.6%) and swing-clearance violations (48.3%), validating the model’s effectiveness across diverse terrains.
Perceptive Behavior Foundation Model: Adapting Human Motion Priors to Robot-Centric Terrain
Motivation and Problem Statement
Generalist behavior foundation models (BFMs) have propelled whole-body humanoid control by leveraging diverse human motion priors for expressive motion tracking via single-policy architectures. However, most prior approaches assume reference motions are inherently compatible with the robot's immediate terrain context, an assumption invalidated when demonstrators, operators, and robots inhabit distinct environmental conditions. This mismatch means human-provided commands lack specification of terrain-based constraints—footholds, swing clearance, body posture, and contact timing necessary for environmental feasibility. The challenge addressed is perceptual grounding of human motion priors: adapting them in real time to the robot's local world without altering the raw command interface.
Methodology: PMT Training Pipeline and TCRS Synthesis
The Perceptive Behavior Foundation Model (Perceptive BFM) introduces a modular, four-stage Perceptive Motion Tracking (PMT) procedure:
- Terrain-Conformal Reference Synthesis (TCRS): TCRS synthesizes terrain-consistent references offline from raw human motion clips and sampled height fields through multi-component processing: contact-aware foothold construction, foot-geometry-aware swing optimization using mid-foot framing, support-aware root reconstruction, collision repair, and multi-point inverse kinematics (IK). The process explicitly rewires lower-body contacts for terrain compliance while preserving upper-body motion style.
- Blind Teacher Training: A Transformer-based teacher policy, devoid of terrain perception, is trained on TCRS-adapted references using PPO, acquiring terrain-conformal motion-tracking capabilities within the privileged command context.
- Distillation into Raw-Reference Student: The vision-conditioned student receives raw reference commands and robot-centric terrain perception. Distillation aligns the teacher's terrain-informed action in the student's raw-reference frame via target-frame action alignment, transferring the teacher’s adapted behavior.
- Identity-Gated Residual Fine-Tuning: The student policy employs identity-gated residual pathways, initialized to zero, enabling the terrain branch to learn local corrections progressively without perturbing the inherited motion-tracking prior.
Figure 1: Single-policy terrain grounding: the Perceptive BFM adapts flat-ground human motion commands to robot-centric terrains, adjusting footholds, swing clearance, posture, and contact timing online.
TCRS notably improves reference feasibility, reducing penetration depth and clearance violation by significant margins compared to collision-agnostic baselines, while preserving motion style (see TCRS evaluation in section below).
Policy Architecture and Observation Contract
The deployed policy leverages a Transformer backbone to encode proprioceptive histories and raw command windows, integrating terrain perception via a torso-centered ray-cast height map encoder. Terrain features enter the actor via two residual pathways: modulating motion intent and action output with identity-gating, assuring a pure raw-reference tracker at network initialization.
Critical differences with prior work include: maintaining raw clip commands at deployment, only referencing terrain-conformal adaptations during training; leveraging residual policies for conservative adaptation; and separating terrain effect from command tracking via explicit architectural gating.
Empirical Evaluation: Reference Quality, Policy Ablations, and Deployment
TCRS Reference Quality
TCRS achieves superior performance in terrain-grounding metrics. On stepping stones with stairs, it reduces mean foot penetration depth to 2.38 cm (from 5.48 cm with Z-offset projection), and swing-clearance violation to 7.4% (from 14.3% cubic interpolation + IK), at a minimal upper-body deviation. These results illustrate that TCRS achieves environment-consistent adaptations without sacrificing motion style.
Strong numerical results: TCRS cuts penetration depth by 56.6%, and clearance violation by 48.3%—a notable improvement in collision-sensitive metrics.
Training-Time Ablations
Training reward diagnostics show terrain perception is essential for successful grounding: removing vision drops reward from 54.6 to 3.6, confirming that capacity and architecture are secondary to perceptual input. Replacing the Transformer backbone yields lesser decrements (mean reward reductions of 5–8 points), reinforcing the architectural robustness. Distillation via target-frame alignment is also critical: its absence reduces reward by 4.5 points.
Real-Robot Deployment
Deployment on a Unitree G1 humanoid demonstrates a single policy tracking locomotion, acrobatics, and teleoperation motions across random terrains—stairs, slopes, sparse supports, grass, and indoor/outdoor obstacles. The policy adapts both foot placement and whole-body execution in response to local perception, and effectively handles operator–environment mismatch, with human mocap commands captured on flat surfaces successfully executed by the robot on complex terrain.
Figure 1: A single Perceptive BFM policy adapts diverse human motion commands to randomly placed terrain layouts, including acrobatic and expressive behaviors.
Practical and Theoretical Implications
The Perceptive BFM architecture advances the paradigm of terrain-aware robot control by:
- Maintaining the raw motion command as the behavioral interface, facilitating broad compatibility and controller reuse.
- Enabling real-time adaptation to arbitrary robot-side terrain without requiring environment-specific controllers or command retargeting at deployment.
- Providing a scalable TCRS mechanism for offline supervision, improving reference feasibility in diverse environments, critical for future large-scale humanoid deployment.
- Introducing architectural generalization via identity-gated residuals, which could be applied to other perception-conditioned control domains.
Theoretically, the model decouples motion intent from environmental realization, enabling principled analysis of generalization and residual learning in the context of exteroception-conditioned robotic behavior.
Limitations and Future Directions
Current adaptation centers on lower-body contacts, leaving upper-body motion unchanged; this leads to collision failures with obstacles, as arms or torso can strike the environment. TCRS assumes static, rigid, and observable height fields, omitting adaptation to deformable, granular, or slippery media. Future directions entail collision-aware upper-body adaptation, systematic evaluation of deployed rollouts, and expansion to more complex terrain types.
Conclusion
Perceptive BFM establishes a single-policy, terrain-aware humanoid motion tracking approach, leveraging raw human motion priors augmented by robot-centric perception. TCRS yields style-preserving, terrain-conformal reference supervision, and PMT transfers adapted behavior into a vision-conditioned student policy with identity-gated residual correction. Results indicate that robot-centric perception suffices for environment compatibility while maintaining a broad behavioral interface, opening new avenues for scalable foundation models in physically diverse deployments.
Reference: "Perceptive Behavior Foundation Model: Adapting Human Motion Priors to Robot-Centric Terrain" (2606.08059)