Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 119 tok/s Pro
Kimi K2 180 tok/s Pro
GPT OSS 120B 418 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Behavior Foundation Model (BFM)

Updated 18 September 2025
  • Behavior Foundation Model (BFM) is a unified generative framework that extracts and generalizes whole-body behavioral patterns from large-scale datasets for humanoid robotics.
  • It integrates goal-conditioned reinforcement learning with CVAE and masked online distillation to transform diverse control signals into coherent motion commands.
  • Empirical results show robust cross-task generalization, zero-shot behavior acquisition, and versatile application in teleoperation, motion tracking, and dynamic locomotion.

Behavior Foundation Model (BFM) is a generative framework designed to unify and generalize behavioral control across diverse whole-body tasks for humanoid robots. Distinguished from conventional, task-specialist models that rely on extensive reward engineering and isolated policy definitions, BFM is pretrained on large-scale behavioral datasets to extract broad, reusable behavioral knowledge. This enables robust cross-task generalization, rapid adaptation to novel behaviors, and flexible operation under arbitrary control modes (e.g., velocity commands, motion tracking, teleoperation), positioning BFM as a candidate for universal humanoid control systems (Zeng et al., 17 Sep 2025).

1. Unification of Whole-Body Control via Behavioral Representation

Central to BFM is the recognition that all control modalities—be they direct velocity commands, VR-based teleoperation signals, or motion-tracking references—are ultimately instantiated as the generation of context-appropriate behaviors that guide the robot toward task goals. Rather than partitioning the control space by modality or task, BFM models the distribution P(τ)P(\tau) over full-body behavioral trajectories, treating goal states and proprioceptive signals as conditioning inputs.

This unified behavioral representation is achieved through generative modeling, which decouples the interpretation of control signals from the policy execution layer. Consequently, novel control interfaces and unseen tasks can be abstracted as instantiations within the learned behavioral manifold.

2. Technical Framework: Masked Online Distillation and CVAE Integration

BFM is realized using a goal-conditioned reinforcement learning (RL) paradigm and is architecturally built upon a Conditional Variational Autoencoder (CVAE). Behavioral trajectories are encoded as state–action pairs (st(p,real),at)(s_t^{(p,\text{real})}, a_t), with goal states st(g,real)s_t^{(g,\text{real})} introduced as latent variables.

  • Objective Function: The pretraining objective seeks to maximize the expected log-likelihood:

maxθ E(st(p,real),at)D[logπθ(atst(p,real))]\max_{\theta} \ \mathbb{E}_{(s_t^{(p,\text{real})}, a_t) \in \mathcal{D}} [\log \pi_\theta(a_t | s_t^{(p,\text{real}))}]

where πθ\pi_\theta is the policy parameterized by θ\theta.

  • Goal Marginalization: The likelihood is further conditioned on goal states via marginalization:

logπθ(atst(p,real))=logEst(g,real)p(st(g,real)st(p,real))[πθ(atst(p,real),st(g,real))]\log \pi_\theta(a_t | s_t^{(p,\text{real})}) = \log \mathbb{E}_{s_t^{(g,\text{real})} \sim p(s_t^{(g,\text{real})} | s_t^{(p,\text{real})})} [\pi_\theta(a_t | s_t^{(p,\text{real})}, s_t^{(g,\text{real})})]

Jensen’s inequality yields a lower bound suitable for optimization.

  • CVAE Evidence Lower Bound (ELBO): The model maximizes

Eq(zst(p,sim),st(g,sim))[logP(atst(p,real),st(g,real),z)DKL(q(zst(p,sim),st(g,sim))P(zst(p,real),st(g,real)))]\mathbb{E}_{q(z | s_t^{(p,\text{sim})}, s_t^{(g,\text{sim})})} \Big[ \log P(a_t | s_t^{(p,\text{real})}, s_t^{(g,\text{real})}, z) - D_{KL}\Big(q(z|s_t^{(p,\text{sim})}, s_t^{(g,\text{sim})}) \parallel P(z|s_t^{(p,\text{real})}, s_t^{(g,\text{real})})\Big) \Big]

with all CVAE components (prior, encoder, decoder) modeled as Gaussian distributions.

  • Masked Online Distillation: To accommodate diverse modes (joint commands, velocities, etc.), BFM applies a bit-wise binary mask to a unified control interface. Mask sampling is from a Bernoulli distribution with an annealed probability, supporting arbitrary control signal combinations and eliminating the need for task-specific mask templates.

3. Dataset Preparation and Behavioral Knowledge Extraction

Training data is derived from high-fidelity human motion datasets (e.g., the AMASS corpus). The retargeting from SMPL human mesh to humanoid morphology is accomplished in two optimization stages: matching rest-pose geometry and refining full-body motion parameters (translation, orientation, joint angles).

Rather than using pre-filtered trajectories, BFM employs a proxy agent trained via imitation learning in simulation, which generates online behavior samples. This approach incorporates:

  • Reward design via weighted sums of task objectives, penalties, and regularization terms;
  • Curriculum learning, initially prioritized imitation rewards, with gradual penalty integration;
  • Domain randomization (e.g., variable dynamics, perturbations) and hard negative mining to maximize robustness and skill diversity;
  • Motion filtering to prevent destabilization from persistently unsuccessful segments.

4. Empirical Performance and Generalization

BFM demonstrates strong cross-task generalization in simulation environments (IsaacGym, Mujoco) and in deployment on physical platforms (Unitree G1). Evaluated tasks include whole-body motion tracking, VR teleoperation, and multimodal locomotion.

Quantitative measures (mean per-keypoint error, per-joint error, velocity tracking accuracy) indicate that BFM matches or surpasses specialist controllers and prior baselines (e.g., HOVER) even when control mode varies or novel behaviors are required.

  • Zero-shot Behavior Acquisition: The latent behavioral space learned by BFM supports extrapolation—such as generating forward rolls, side saltos, or butterfly kicks—without retraining from scratch. Challenging maneuvers that would traditionally lead to instability (e.g., Butterfly Kick with off-balance dynamics) are accommodated through latent space modulation.

5. Real-World Applications

BFM’s flexible control interface supports diverse deployment scenarios:

  • Teleoperation: Human operators issue arbitrary control commands (e.g., from VR devices), which BFM generalizes into valid motions.
  • Motion Tracking: Real-time pose inputs from motion capture systems drive complex humanoid execution.
  • Dynamic Locomotion: Robust adaptation to challenging terrains and task configurations (search and rescue, entertainment robotics, HRI) is facilitated by the learned behavioral priors.

One practical limitation is sim-to-real transfer: despite domain randomization, covariance between simulated and embodied system dynamics can necessitate fine-tuning for platform-specific behaviors, especially where sensor noise and actuation delay are significant.

6. Future Directions and Research Opportunities

Plans for advancing BFM include:

  • Extension of the control interface to encompass higher-level modalities, e.g., natural language or symbolic instructions;
  • Enhanced masking strategies and latent space regularization to further improve composition and modulation of behaviors;
  • Exploration of rich latent representations for superior interpolation/extrapolation and residual learning;
  • Continued development in sim-to-real adaptation—potentially by incorporating additional proprioceptive constraints or advanced domain adaptation techniques.

The integration of BFM as a general behavior engine not only addresses scalability and general-purpose control, but also opens pathways for unified cognitive–physical architectures in next-generation humanoid robotic systems.

Summary Table: BFM Capabilities and Features

Feature Technical Realization Empirical Outcome
Unification of control modes Masked online distillation + CVAE Robust cross-task generalization
Behavioral knowledge extraction Retargeted AMASS data + proxy agents Zero-shot acquisition of novel skills
Flexibility in real-world settings Bit-wise sampled control mask Adaptation to teleoperation, motion tracking, locomotion
Latent space for behavior modulation Goal-conditioned RL + structured latent variables Efficient extrapolation, behavior composition

The Behavior Foundation Model exemplifies the transition from task-specific humanoid control to a foundation-style paradigm grounded in behavioral generative modeling, highlighting both immediate capabilities and avenues for future research in flexible, scalable whole-body robot intelligence (Zeng et al., 17 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Behavior Foundation Model (BFM).