Perceptive Behavior Foundation Model: Adapting Human Motion Priors to Robot-Centric Terrain

Published 6 Jun 2026 in cs.RO | (2606.08059v1)

Abstract: Humanoid behavior foundation models aim to acquire reusable whole-body control policies from broad human motion priors, enabling a single controller to produce diverse and expressive behaviors. However, existing motion-centric foundation policies largely assume that the reference motion is already physically compatible with the robot's surroundings. This assumption breaks when the demonstrator, operator, and robot inhabit different environments: a human motion may specify the intended behavior, but not the footholds, clearance, body height, or contact timing required by the robot's local terrain. We introduce \emph{Perceptive Behavior Foundation Model} (Perceptive BFM), a terrain-aware humanoid control framework that grounds human motion priors in robot-centric perception. The model preserves raw kinematic motion references as the behavioral interface, while using local terrain observations to adapt contacts, posture, and timing. To provide scalable terrain supervision, we develop \emph{terrain-conformal reference synthesis} (TCRS), which converts locomotion-oriented human motion clips into terrain-consistent references through contact-aware foothold construction, foot-geometry-aware swing optimization, support-aware root reconstruction, collision repair, and multi-point inverse kinematics. We then train a blind adapted-reference teacher and transfer its terrain-conformal behavior to a deployed raw-reference student through target-frame action alignment. The student is an identity-gated Transformer tracker whose terrain features enter through residual pathways initialized to preserve the motion-tracking prior and trained to produce local corrections only when needed.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper presents a novel terrain adaptation approach that integrates TCRS to synthesize terrain‐specific motion references from human motion clips.
It employs a Transformer-based teacher policy and distillation into a vision-conditioned student to achieve robust, real-time terrain-adjusted control.
Empirical results show significant reductions in foot penetration (56.6%) and swing-clearance violations (48.3%), validating the model’s effectiveness across diverse terrains.

Perceptive Behavior Foundation Model: Adapting Human Motion Priors to Robot-Centric Terrain

Motivation and Problem Statement

Generalist behavior foundation models (BFMs) have propelled whole-body humanoid control by leveraging diverse human motion priors for expressive motion tracking via single-policy architectures. However, most prior approaches assume reference motions are inherently compatible with the robot's immediate terrain context, an assumption invalidated when demonstrators, operators, and robots inhabit distinct environmental conditions. This mismatch means human-provided commands lack specification of terrain-based constraints—footholds, swing clearance, body posture, and contact timing necessary for environmental feasibility. The challenge addressed is perceptual grounding of human motion priors: adapting them in real time to the robot's local world without altering the raw command interface.

Methodology: PMT Training Pipeline and TCRS Synthesis

The Perceptive Behavior Foundation Model (Perceptive BFM) introduces a modular, four-stage Perceptive Motion Tracking (PMT) procedure:

Terrain-Conformal Reference Synthesis (TCRS): TCRS synthesizes terrain-consistent references offline from raw human motion clips and sampled height fields through multi-component processing: contact-aware foothold construction, foot-geometry-aware swing optimization using mid-foot framing, support-aware root reconstruction, collision repair, and multi-point inverse kinematics (IK). The process explicitly rewires lower-body contacts for terrain compliance while preserving upper-body motion style.
Blind Teacher Training: A Transformer-based teacher policy, devoid of terrain perception, is trained on TCRS-adapted references using PPO, acquiring terrain-conformal motion-tracking capabilities within the privileged command context.
Distillation into Raw-Reference Student: The vision-conditioned student receives raw reference commands and robot-centric terrain perception. Distillation aligns the teacher's terrain-informed action in the student's raw-reference frame via target-frame action alignment, transferring the teacher’s adapted behavior.
Identity-Gated Residual Fine-Tuning: The student policy employs identity-gated residual pathways, initialized to zero, enabling the terrain branch to learn local corrections progressively without perturbing the inherited motion-tracking prior.
Figure 1: Single-policy terrain grounding: the Perceptive BFM adapts flat-ground human motion commands to robot-centric terrains, adjusting footholds, swing clearance, posture, and contact timing online.

TCRS notably improves reference feasibility, reducing penetration depth and clearance violation by significant margins compared to collision-agnostic baselines, while preserving motion style (see TCRS evaluation in section below).

Policy Architecture and Observation Contract

The deployed policy leverages a Transformer backbone to encode proprioceptive histories and raw command windows, integrating terrain perception via a torso-centered ray-cast height map encoder. Terrain features enter the actor via two residual pathways: modulating motion intent and action output with identity-gating, assuring a pure raw-reference tracker at network initialization.

Critical differences with prior work include: maintaining raw clip commands at deployment, only referencing terrain-conformal adaptations during training; leveraging residual policies for conservative adaptation; and separating terrain effect from command tracking via explicit architectural gating.

Empirical Evaluation: Reference Quality, Policy Ablations, and Deployment

TCRS Reference Quality

TCRS achieves superior performance in terrain-grounding metrics. On stepping stones with stairs, it reduces mean foot penetration depth to 2.38 cm (from 5.48 cm with Z-offset projection), and swing-clearance violation to 7.4% (from 14.3% cubic interpolation + IK), at a minimal upper-body deviation. These results illustrate that TCRS achieves environment-consistent adaptations without sacrificing motion style.

Strong numerical results: TCRS cuts penetration depth by 56.6%, and clearance violation by 48.3%—a notable improvement in collision-sensitive metrics.

Training-Time Ablations

Training reward diagnostics show terrain perception is essential for successful grounding: removing vision drops reward from 54.6 to 3.6, confirming that capacity and architecture are secondary to perceptual input. Replacing the Transformer backbone yields lesser decrements (mean reward reductions of 5–8 points), reinforcing the architectural robustness. Distillation via target-frame alignment is also critical: its absence reduces reward by 4.5 points.

Real-Robot Deployment

Deployment on a Unitree G1 humanoid demonstrates a single policy tracking locomotion, acrobatics, and teleoperation motions across random terrains—stairs, slopes, sparse supports, grass, and indoor/outdoor obstacles. The policy adapts both foot placement and whole-body execution in response to local perception, and effectively handles operator–environment mismatch, with human mocap commands captured on flat surfaces successfully executed by the robot on complex terrain.

Figure 1: A single Perceptive BFM policy adapts diverse human motion commands to randomly placed terrain layouts, including acrobatic and expressive behaviors.

Practical and Theoretical Implications

The Perceptive BFM architecture advances the paradigm of terrain-aware robot control by:

Maintaining the raw motion command as the behavioral interface, facilitating broad compatibility and controller reuse.
Enabling real-time adaptation to arbitrary robot-side terrain without requiring environment-specific controllers or command retargeting at deployment.
Providing a scalable TCRS mechanism for offline supervision, improving reference feasibility in diverse environments, critical for future large-scale humanoid deployment.
Introducing architectural generalization via identity-gated residuals, which could be applied to other perception-conditioned control domains.

Theoretically, the model decouples motion intent from environmental realization, enabling principled analysis of generalization and residual learning in the context of exteroception-conditioned robotic behavior.

Limitations and Future Directions

Current adaptation centers on lower-body contacts, leaving upper-body motion unchanged; this leads to collision failures with obstacles, as arms or torso can strike the environment. TCRS assumes static, rigid, and observable height fields, omitting adaptation to deformable, granular, or slippery media. Future directions entail collision-aware upper-body adaptation, systematic evaluation of deployed rollouts, and expansion to more complex terrain types.

Conclusion

Perceptive BFM establishes a single-policy, terrain-aware humanoid motion tracking approach, leveraging raw human motion priors augmented by robot-centric perception. TCRS yields style-preserving, terrain-conformal reference supervision, and PMT transfers adapted behavior into a vision-conditioned student policy with identity-gated residual correction. Results indicate that robot-centric perception suffices for environment compatibility while maintaining a broad behavioral interface, opening new avenues for scalable foundation models in physically diverse deployments.

Reference: "Perceptive Behavior Foundation Model: Adapting Human Motion Priors to Robot-Centric Terrain" (2606.08059)

Markdown Report Issue