Papers
Topics
Authors
Recent
2000 character limit reached

Cross-Embodied Foundation Models

Updated 21 November 2025
  • Cross-embodied foundation models are unified architectures that generalize across various physical and virtual embodiments by aligning diverse sensor, perception, and control spaces.
  • They leverage shared representations and transfer mechanisms to facilitate zero-shot or few-shot adaptation, effectively mitigating data fragmentation across systems.
  • Empirical results demonstrate state-of-the-art performance in tasks like navigation and manipulation, highlighting robust generalization and efficient skill transfer.

A cross-embodied foundation model is a unified machine learning architecture and training paradigm that enables a single model to represent, control, or reason across multiple physical or virtual embodiments—such as diverse robots, agents, or sensor modalities—while generalizing effectively across tasks and domains. These models leverage embodiment-agnostic representations and transfer mechanisms to align and bridge data, perception, and control spaces, eliminating or reducing the need for per-embodiment or per-task model tuning. Cross-embodied foundation modeling is motivated by the desire to achieve truly generalist agency, supporting transfer and coordination between heterogeneous systems, and maximizing data efficiency and generalization.

1. Foundations and Motivations

The principal goal of cross-embodied foundation models is to develop a unified model and interface operating across diverse embodiments, tasks, and domains. This is driven by the need to overcome the limitations of standard large-scale vision-LLMs (VLMs) and embodied agents, which often specialize in either simulation or a single robot platform and cannot handle cross-embodiment transfer or generalize to new morphologies without explicit re-training or heavy adaptation. In this context, embodiment refers to the physical or simulated structures (robots, agents, sensors) that perceive, act, or reason in their respective environments.

The motivation is multifold:

  • Data fragmentation: Robotic data is fragmented by embodiment-specific action and observation spaces, impeding shared learning and transfer.
  • Efficient transfer: A common representation space, capable of mapping across morphologies, is essential for zero-shot or few-shot transfer, rapid adaptation, and efficient use of large cross-domain datasets.
  • Generalist reasoning: Beyond perception and control, these models also aim to unify reasoning across tasks (e.g., navigation, manipulation, planning) and environments (digital-physical, navigation-driving, manipulation-autonomous driving) (Hao et al., 20 Nov 2025, Tan et al., 28 Oct 2025, Zheng et al., 17 Jan 2025, He et al., 3 Nov 2025).

2. Core Architectural and Representational Principles

Cross-embodied foundation models instantiate several key principles to facilitate embodiment-invariance and flexible adaptation:

3. Model Training Paradigms and Objectives

Cross-embodied foundation models employ multi-stage or joint training to propagate shared structure across embodiments, tasks, and domains.

Optimization objectives are selected to encourage cross-embodiment equivalence and compositionality, e.g., via:

4. Transfer Mechanisms and Alignment Strategies

A critical challenge is bridging heterogeneity in action, observation, and control spaces. Several mechanisms have been operationalized:

  • Universal action spaces: Establishing a shared, discrete VQ-VAE-style codebook of atomic behaviors (U\mathcal{U}) enables policy transfer and data aggregation across embodiments; per-embodiment lightweight MLP heads suffice for grounding (Zheng et al., 17 Jan 2025).
  • Particle-based world modeling: Particle graphs with message-passing GNNs provide a kinematics-agnostic dynamics interface, abstracting away joint or link specifics. Actions as particle displacements enable control of both human and robot hands (He et al., 3 Nov 2025).
  • Affordance space equivariance: A latent vector fusing object, action, and effect provides an "interlingua" for cross-embodiment transfer, where equivalent effects across distinct actions or agents are mapped to a shared affordance embedding (Aktas et al., 24 Apr 2024).
  • Fine-tuning adapters: Low-rank LoRA adaptation inside backbone layers, often only a small fraction of the full parameter count, allows specialization for new tasks or sensor modalities without catastrophic interference (Guo et al., 22 Aug 2024, Zhang et al., 30 May 2024).
  • Prompt or tokenized context: Embodiment identity and time/view tokens injected at the perception stage enable zero-shot handling of varying sensor arrays and temporal horizons (Zhang et al., 15 Sep 2025).

5. Empirical Results and Generalization Benchmarks

Cross-embodied foundation models have demonstrated strong empirical performance:

  • Outperforming specialized models: Architectures such as MiMo-Embodied and BLM1_1 attain state-of-the-art or superior results across 17+ embodied AI and 12+ autonomous driving/public digital benchmarks, with mean gains of 3–20 percentage points depending on task (Hao et al., 20 Nov 2025, Tan et al., 28 Oct 2025).
  • Zero-shot and few-shot transfer: Both navigation (NavFoM) and manipulation (UniAct, Affordance Blending, GenRL) models outperform baselines in zero-shot cross-embodiment settings—e.g., transferring from human to robot hands (He et al., 3 Nov 2025), or new mobile robots after only lightweight head fine-tuning (Zheng et al., 17 Jan 2025).
  • Adaptation with minimal data: LoRA adapters or universal codebook fine-tuning require as few as 115 labeled samples for superior performance in geoscience imaging (Guo et al., 22 Aug 2024) or rapid deployment to new robots (Zheng et al., 17 Jan 2025).
  • Unified modeling: MiMo-Embodied and NavFoM show that a single backbone can handle planning, perception, affordance, and language reasoning without per-task or per-embodiment heads (Hao et al., 20 Nov 2025, Zhang et al., 15 Sep 2025).

Table: Quantitative Cross-Embodiment Results (selected)

Model Task Domain(s) Cross-Embodiment Gen. SOTA Comparison
UniAct-0.5B Multi-robot manipulation 65% LIBERO sim SR +17pp over OpenVLA (7B)
MiMo-Embodied Embodied AI + Driving +10–20pp embodied AI SOTA on 29 benchmarks
BLM1_1 Digital & physical robot tasks +6%/+3% over prior Superior in digital+physical
NavFoM Navigation, multi-robot/drone Zero-shot, 4–12% SR SOTA navigation tasks
AffordanceBN Object–action–effect transfer Direct, 90% success -

6. Practical Considerations and Limitations

While cross-embodied foundation models advance generalization and data efficiency, specific constraints remain:

  • Perceptual bottlenecks: Particle-based or affordance representations require dense, high-fidelity perception; single-view or low-resolution settings can degrade model input and prediction.
  • Forward model assumptions: Control transfer depends on known forward kinematics or on-the-fly adaptation modules, especially for novel robot morphologies (He et al., 3 Nov 2025).
  • Action abstraction granularity: Universal action codebooks must balance coverage of embodied behaviors with manageable codebook size. Excessive abstraction risks omitting control-relevant nuances, while overly fine discretization impedes transfer (Zheng et al., 17 Jan 2025).
  • Planning horizon: World-model-based controllers often require short planning horizons to mitigate error compounding; extending these with uncertainty quantification or hierarchical planning is ongoing work (He et al., 3 Nov 2025).
  • Scaling and domain gaps: Full cross-domain generalization (e.g., from manipulation to navigation or driving) is facilitated by multi-stage or chain-of-thought RL pipelines and requires extremely broad and well-scripted pre-training corpora (Hao et al., 20 Nov 2025, Tan et al., 28 Oct 2025).

7. Prospects and Future Directions

Several research directions are highlighted:

  • End-to-end embodied perception: Direct learning of particle/object-based state representations from raw images, potentially with in-the-loop RL or contrastive objectives (He et al., 3 Nov 2025).
  • Action-conditional model uncertainty: Integration of uncertainty estimation into world models to enable longer-horizon, safer planning (He et al., 3 Nov 2025).
  • Hierarchical and modular adapters: Combining universal action spaces or affordance latents with per-embodiment adapters for non-anthropomorphic or weakly structured morphologies (Zheng et al., 17 Jan 2025, Aktas et al., 24 Apr 2024).
  • Language grounding and planning: Formalizing mappings from affordance or action codebooks to natural-language descriptors, and composing sequential skills for multi-step embodied planning (Aktas et al., 24 Apr 2024, Tan et al., 28 Oct 2025).
  • Scaling up datasets and tasks: Building larger, more diverse cross-embodiment datasets to push the limits of zero-shot generalization, especially in complex manipulation and multi-agent settings (He et al., 3 Nov 2025, Hao et al., 20 Nov 2025).

The convergence of vision-language backbones, universal action abstractions, and world modeling architectures in cross-embodied foundation models marks a decisive step toward lifelong, generalist agent architectures capable of robust, data-efficient skill acquisition, transfer, and reasoning across heterogeneous embodiments and real-world domains.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Cross-Embodied Foundation Model.