Cross-Embodied Foundation Models

Updated 21 November 2025

Cross-embodied foundation models are unified architectures that generalize across various physical and virtual embodiments by aligning diverse sensor, perception, and control spaces.
They leverage shared representations and transfer mechanisms to facilitate zero-shot or few-shot adaptation, effectively mitigating data fragmentation across systems.
Empirical results demonstrate state-of-the-art performance in tasks like navigation and manipulation, highlighting robust generalization and efficient skill transfer.

A cross-embodied foundation model is a unified machine learning architecture and training paradigm that enables a single model to represent, control, or reason across multiple physical or virtual embodiments—such as diverse robots, agents, or sensor modalities—while generalizing effectively across tasks and domains. These models leverage embodiment-agnostic representations and transfer mechanisms to align and bridge data, perception, and control spaces, eliminating or reducing the need for per-embodiment or per-task model tuning. Cross-embodied foundation modeling is motivated by the desire to achieve truly generalist agency, supporting transfer and coordination between heterogeneous systems, and maximizing data efficiency and generalization.

1. Foundations and Motivations

The principal goal of cross-embodied foundation models is to develop a unified model and interface operating across diverse embodiments, tasks, and domains. This is driven by the need to overcome the limitations of standard large-scale vision-LLMs (VLMs) and embodied agents, which often specialize in either simulation or a single robot platform and cannot handle cross-embodiment transfer or generalize to new morphologies without explicit re-training or heavy adaptation. In this context, embodiment refers to the physical or simulated structures (robots, agents, sensors) that perceive, act, or reason in their respective environments.

The motivation is multifold:

Data fragmentation: Robotic data is fragmented by embodiment-specific action and observation spaces, impeding shared learning and transfer.
Efficient transfer: A common representation space, capable of mapping across morphologies, is essential for zero-shot or few-shot transfer, rapid adaptation, and efficient use of large cross-domain datasets.
Generalist reasoning: Beyond perception and control, these models also aim to unify reasoning across tasks (e.g., navigation, manipulation, planning) and environments (digital-physical, navigation-driving, manipulation-autonomous driving) (Hao et al., 20 Nov 2025, Tan et al., 28 Oct 2025, Zheng et al., 17 Jan 2025, He et al., 3 Nov 2025).

2. Core Architectural and Representational Principles

Cross-embodied foundation models instantiate several key principles to facilitate embodiment-invariance and flexible adaptation:

Unified state/action encoding: Representing states and actions in an abstract, shared space—such as 3D particle sets for hands and objects (He et al., 3 Nov 2025), universal discrete action codebooks (Zheng et al., 17 Jan 2025), or latent affordance vectors (Aktas et al., 24 Apr 2024)—enables commensurate representation across morphologies.
Backbone generalization: Foundation backbones, be they transformer-based VLMs (Tan et al., 28 Oct 2025, Hao et al., 20 Nov 2025, Zhang et al., 15 Sep 2025), large ViT encoders (Guo et al., 22 Aug 2024), or promptable segmentation models (Zhang et al., 30 May 2024), are parameterized and pre-trained to facilitate transfer, often with minimal adaptation.
Alignment/bridging interfaces: Adaptation modules such as LoRA adapters (Guo et al., 22 Aug 2024, Zhang et al., 30 May 2024), Perceiver compressions (Tan et al., 28 Oct 2025), or connector/aligner layers (Mazzaglia et al., 26 Jun 2024) bind embodiment-agnostic representations to embodiment-specific policy or decoder heads.
Task, space, and embodiment tags: Explicitly encoding embodiment, camera, temporal horizon, or spatial context via indicator tokens (Zhang et al., 15 Sep 2025), tags, or group labels assists the backbone in disambiguating between embodiments and contexts while maintaining shared representation.

3. Model Training Paradigms and Objectives

Cross-embodied foundation models employ multi-stage or joint training to propagate shared structure across embodiments, tasks, and domains.

Stage separation: Digital-only instruction-following is refined with physical/embodied knowledge injection, followed by policy module training while freezing the generalist backbone (Tan et al., 28 Oct 2025, Hao et al., 20 Nov 2025).
Multi-stage adaptation: For example, MiMo-Embodied introduces sequential supervised tuning on embodied AI, driving, chain-of-thought, and RL objectives, each building cross-domain and cross-task generalization (Hao et al., 20 Nov 2025).
Behavior cloning and reconstruction: Universal action or affordance spaces are learned via behavior cloning losses, with decoding heads per embodiment translating back to native control commands (Zheng et al., 17 Jan 2025, Aktas et al., 24 Apr 2024, Zhang et al., 30 May 2024).
World-model imagination: Generative world models, such as recurrent state-space models (RSSMs) or graph message-passing architectures, are coupled with reinforcement learning inside the learned dynamics space for "imagination-based" policy optimization, sidestepping the need for explicit reward engineering or real-world samples during policy learning (Mazzaglia et al., 26 Jun 2024, He et al., 3 Nov 2025).

Optimization objectives are selected to encourage cross-embodiment equivalence and compositionality, e.g., via:

Latent reconstruction or matching losses for effect, action, and object trajectories (Aktas et al., 24 Apr 2024).
Kullback-Leibler (KL) or cosine similarity matching between alignment modules (Mazzaglia et al., 26 Jun 2024).
Cross-entropy and L1/L2 regression for perception, spatial, and planning tasks (Hao et al., 20 Nov 2025).
Policy gradient methods for RL fine-tuning, including group-based or imagination-based training (Hao et al., 20 Nov 2025, Mazzaglia et al., 26 Jun 2024).

4. Transfer Mechanisms and Alignment Strategies

A critical challenge is bridging heterogeneity in action, observation, and control spaces. Several mechanisms have been operationalized:

Universal action spaces: Establishing a shared, discrete VQ-VAE-style codebook of atomic behaviors ( $\mathcal{U}$ ) enables policy transfer and data aggregation across embodiments; per-embodiment lightweight MLP heads suffice for grounding (Zheng et al., 17 Jan 2025).
Particle-based world modeling: Particle graphs with message-passing GNNs provide a kinematics-agnostic dynamics interface, abstracting away joint or link specifics. Actions as particle displacements enable control of both human and robot hands (He et al., 3 Nov 2025).
Affordance space equivariance: A latent vector fusing object, action, and effect provides an "interlingua" for cross-embodiment transfer, where equivalent effects across distinct actions or agents are mapped to a shared affordance embedding (Aktas et al., 24 Apr 2024).
Fine-tuning adapters: Low-rank LoRA adaptation inside backbone layers, often only a small fraction of the full parameter count, allows specialization for new tasks or sensor modalities without catastrophic interference (Guo et al., 22 Aug 2024, Zhang et al., 30 May 2024).
Prompt or tokenized context: Embodiment identity and time/view tokens injected at the perception stage enable zero-shot handling of varying sensor arrays and temporal horizons (Zhang et al., 15 Sep 2025).

5. Empirical Results and Generalization Benchmarks

Cross-embodied foundation models have demonstrated strong empirical performance:

Outperforming specialized models: Architectures such as MiMo-Embodied and BLM $_1$ attain state-of-the-art or superior results across 17+ embodied AI and 12+ autonomous driving/public digital benchmarks, with mean gains of 3–20 percentage points depending on task (Hao et al., 20 Nov 2025, Tan et al., 28 Oct 2025).
Zero-shot and few-shot transfer: Both navigation (NavFoM) and manipulation (UniAct, Affordance Blending, GenRL) models outperform baselines in zero-shot cross-embodiment settings—e.g., transferring from human to robot hands (He et al., 3 Nov 2025), or new mobile robots after only lightweight head fine-tuning (Zheng et al., 17 Jan 2025).
Adaptation with minimal data: LoRA adapters or universal codebook fine-tuning require as few as 115 labeled samples for superior performance in geoscience imaging (Guo et al., 22 Aug 2024) or rapid deployment to new robots (Zheng et al., 17 Jan 2025).
Unified modeling: MiMo-Embodied and NavFoM show that a single backbone can handle planning, perception, affordance, and language reasoning without per-task or per-embodiment heads (Hao et al., 20 Nov 2025, Zhang et al., 15 Sep 2025).

Table: Quantitative Cross-Embodiment Results (selected)

Model	Task Domain(s)	Cross-Embodiment Gen.	SOTA Comparison
UniAct-0.5B	Multi-robot manipulation	65% LIBERO sim SR	+17pp over OpenVLA (7B)
MiMo-Embodied	Embodied AI + Driving	+10–20pp embodied AI	SOTA on 29 benchmarks
BLM $_1$	Digital & physical robot tasks	+6%/+3% over prior	Superior in digital+physical
NavFoM	Navigation, multi-robot/drone	Zero-shot, 4–12% SR	SOTA navigation tasks
AffordanceBN	Object–action–effect transfer	Direct, 90% success	-

6. Practical Considerations and Limitations

While cross-embodied foundation models advance generalization and data efficiency, specific constraints remain:

Perceptual bottlenecks: Particle-based or affordance representations require dense, high-fidelity perception; single-view or low-resolution settings can degrade model input and prediction.
Forward model assumptions: Control transfer depends on known forward kinematics or on-the-fly adaptation modules, especially for novel robot morphologies (He et al., 3 Nov 2025).
Action abstraction granularity: Universal action codebooks must balance coverage of embodied behaviors with manageable codebook size. Excessive abstraction risks omitting control-relevant nuances, while overly fine discretization impedes transfer (Zheng et al., 17 Jan 2025).
Planning horizon: World-model-based controllers often require short planning horizons to mitigate error compounding; extending these with uncertainty quantification or hierarchical planning is ongoing work (He et al., 3 Nov 2025).
Scaling and domain gaps: Full cross-domain generalization (e.g., from manipulation to navigation or driving) is facilitated by multi-stage or chain-of-thought RL pipelines and requires extremely broad and well-scripted pre-training corpora (Hao et al., 20 Nov 2025, Tan et al., 28 Oct 2025).

7. Prospects and Future Directions

Several research directions are highlighted:

End-to-end embodied perception: Direct learning of particle/object-based state representations from raw images, potentially with in-the-loop RL or contrastive objectives (He et al., 3 Nov 2025).
Action-conditional model uncertainty: Integration of uncertainty estimation into world models to enable longer-horizon, safer planning (He et al., 3 Nov 2025).
Hierarchical and modular adapters: Combining universal action spaces or affordance latents with per-embodiment adapters for non-anthropomorphic or weakly structured morphologies (Zheng et al., 17 Jan 2025, Aktas et al., 24 Apr 2024).
Language grounding and planning: Formalizing mappings from affordance or action codebooks to natural-language descriptors, and composing sequential skills for multi-step embodied planning (Aktas et al., 24 Apr 2024, Tan et al., 28 Oct 2025).
Scaling up datasets and tasks: Building larger, more diverse cross-embodiment datasets to push the limits of zero-shot generalization, especially in complex manipulation and multi-agent settings (He et al., 3 Nov 2025, Hao et al., 20 Nov 2025).

The convergence of vision-language backbones, universal action abstractions, and world modeling architectures in cross-embodied foundation models marks a decisive step toward lifelong, generalist agent architectures capable of robust, data-efficient skill acquisition, transfer, and reasoning across heterogeneous embodiments and real-world domains.