Multi-Embodiment Pretraining in Robotics

Updated 3 February 2026

Multi-Embodiment Pretraining is a paradigm that trains models using data from diverse embodiments, enabling policies to generalize across varied robots and sensor configurations.
It employs unified action-observation spaces, morphological descriptor conditioning, and modular network designs to facilitate rapid transfer and effective zero-shot deployment.
Empirical results demonstrate significant gains in few-shot adaptation, cross-robot robustness, and performance improvements over single-domain training methods.

Multi-Embodiment Pretraining refers to a broad class of methods by which models—typically in robotics, reinforcement learning, or representation learning—are trained using data that spans multiple embodiments, i.e., variations in agent morphology, sensor configuration, actuation topology, or even species (e.g., humans and robots). This paradigm enables a single policy, model, or representation to generalize across diverse platforms, supporting rapid adaptation, transfer, and zero-shot deployment to new robots or tasks. Multi-Embodiment Pretraining is variously realized through concatenation of diverse pretrained embeddings, unified action/observation architectures, morphological descriptor conditioning, equivariant or modular network design, and large-scale data curation. The following sections synthesize contemporary approaches, objectives, methodological trends, and the empirical impacts of multi-embodiment pretraining across the relevant subfields.

1. Formal Problem Setting and Motivations

Multi-Embodiment Pretraining arises from the need to transcend the limitations of models and controllers confined to a single morphology, sensor suite, or platform. In reinforcement learning and robotic control, traditional paradigm trains a separate policy $\pi_e$ for each embodiment $e$ . Multi-Embodiment Pretraining seeks a universal policy $\pi_\theta(a|o,d)$ , where $d$ encodes the embodiment's parameters (joint structure, sizes, masses, calibration), or where actions/observations are mapped to a shared, normalized, or modular space. In language and vision, analogous approaches concatenate multiple pretrained embeddings to create a more diverse, expressive representation than is possible with any single embedding alone, thereby improving performance across a range of tasks and domains.

Motivations include:

Robust transfer to novel robots, especially in few-shot settings (Lin et al., 30 Nov 2025, Cho et al., 2024, Yu et al., 2022).
Efficient use of heterogeneous prior demonstrations, including human videos/teleoperation (Luo et al., 19 Jan 2026, Niu et al., 19 Jun 2025, Lee et al., 26 Nov 2025).
Emergent skills and performance boosts on data-poor platforms by leveraging knowledge from data-rich ones (Collaboration et al., 2023, Yang et al., 13 Jun 2025, Tan et al., 28 Oct 2025).
Scalable frameworks for generalist, “foundation” robotics models (Luo et al., 19 Jan 2026, Collaboration et al., 2023, Tan et al., 28 Oct 2025).

2. Approaches: Representation Unification and Embodiment Conditioning

a) Action and Observation Space Alignment

A central methodological axis is the alignment (or unification) of action and observation spaces across embodiments. This is achieved through:

Zero-padding and masking of actions/observations to a common maximum dimensionality with explicit masking for valid elements (Yang et al., 13 Jun 2025, Lin et al., 30 Nov 2025).
Joint-level tokenization, representing each state and action as sequences of per-joint tokens, supporting arbitrary morphologies and actuation configurations (Cho et al., 2024).
Sparse “slot”-based unified action spaces, mapping each embodiment’s controls into a fixed-dimensional vector with semantic slots (eef pose, gripper width, base velocity), using fixed, sparse linear maps (Luo et al., 19 Jan 2026).
Modular architectures with per-modality tokenizers and detokenizers, enabling explicit masking and information routing depending on available sensor/effector modalities (Niu et al., 19 Jun 2025).

b) Morphological Descriptor Conditioning

Several approaches condition models on explicit, learned, or structured morphological descriptors $d$ , such as link lengths, masses, PD gains, or kinematic parameters, provided as additional inputs at each timestep (Bohlinger et al., 2 Sep 2025, Yu et al., 2022, Lin et al., 30 Nov 2025). Attention-based encoding of joint descriptions and dynamic properties further supports embodiment-aware reasoning (Bohlinger et al., 2 Sep 2025).

c) Equivariant and Geometry-Aware Policy Design

Formulations that embed embodiment configuration symmetry into the structure of the model have emerged, especially in vision-language-action (VLA) domains. Embodiment-equivariant policies are constructed such that the agent’s output transforms commensurately with redefinitions of base, camera, or end-effector frames, enforced analytically in action decoders and architecture (Chen et al., 18 Sep 2025). Geometry-aware attention and positional encoding methods augment spatial reasoning capabilities across embodiments.

3. Pretraining Pipelines, Objectives, and Datasets

a) Dataset Design and Sampling

A critical driver of cross-embodiment generalization is the construction of diverse, large-scale multi-embodiment datasets:

Collecting millions of transitions across 4–50 unique robots in simulation and reality (Yang et al., 13 Jun 2025, Bohlinger et al., 2 Sep 2025, Collaboration et al., 2023).
Cross-sampling and synthetic pairing for extension to multi-arm settings without additional collection (Li et al., 3 Nov 2025).
Inclusion of human demonstrations and egocentric video in a common coordinate frame, allowing for human-to-robot transfer (Luo et al., 19 Jan 2026, Niu et al., 19 Jun 2025, Lee et al., 26 Nov 2025).

b) Pretraining Objectives

Pretraining is typically accomplished via:

Behavior cloning (maximum-likelihood on action/state sequences) in a “pool-all-embodiments” dataset (Yu et al., 2022, Collaboration et al., 2023, Lin et al., 30 Nov 2025).
Intrinsic reward maximization under a controlled embodiment MDP, maximizing discriminability or information-theoretic diversity across morphologies, as in the PEAC approach (2405.14073).
Unsupervised or reward-free objectives using discriminators to drive embodiment-aware exploration or skill discovery (2405.14073).
Flow-matching, stochastic interpolant, or diffusion-policy losses for high-dimensional continuous control (Yang et al., 13 Jun 2025, Tan et al., 28 Oct 2025).

c) Meta-Learning and Fine-Tuning

Meta-learning frameworks (episodic few-shot regimes) support explicit task and embodiment adaptation by optimizing a shared backbone plus per-embodiment adaptation parameters (Cho et al., 2024). Few-shot fine-tuning or synthetic continued pretraining provide efficient adaptation to unseen robots or collaborative multi-robot settings (Li et al., 3 Nov 2025, Lin et al., 30 Nov 2025).

4. Network Architectures and Algorithmic Innovations

Multiple network abstractions have been proposed:

Modular and attention-based policy networks (e.g., URMAv2 with joint-wise attention for input/output, modular transformers with per-modality tokenization) (Bohlinger et al., 2 Sep 2025, Niu et al., 19 Jun 2025).
Mixture-of-Experts and Mixture-of-Flow architectures, routing between generalist and specialist policy components, controlled by input gating and manifold-preserving metrics for robust deployment (Luo et al., 19 Jan 2026).
Diffusion transformers (DiT), flow decoders in learned geometric (trace) space, and hybrid generative–residual policies that combine generative priors with RL residuals (Yang et al., 13 Jun 2025, Lee et al., 26 Nov 2025, Tan et al., 28 Oct 2025).
Equivariant decoders and geometry-aware attention mechanisms for action representation and policy output (Chen et al., 18 Sep 2025).
Meta-Controller’s structure–motion state encoding, combining shared and per-embodiment adaptation weights for sample-efficient few-shot imitation (Cho et al., 2024).

5. Empirical Results and Comparative Impact

Extensive empirical evaluation establishes the effectiveness of multi-embodiment pretraining:

RT-X family (multi-robot transformer) demonstrates 50% improvement over single-domain baselines on data-scarce robots and enables emergent cross-robot behaviors—tasks annotated for one platform are executed on another with no explicit adaptation (Collaboration et al., 2023).
Being-H0.5 achieves near-saturation on simulation (LIBERO 98.9%) and strong real-world cross-platform performance (80–98% across spatial, long-horizon, and bimanual tasks), with pronounced gains in few-shot adaptation and robustness to sensory drift when using manifold-preserving gating (Luo et al., 19 Jan 2026).
TraceGen, using trace-space video modeling, enables cross-embodiment and human-to-robot transfer: five human phone videos suffice to achieve two-thirds success rate on a real robot, highlighting the utility of geometric abstraction and large-scale multi-embodiment pretraining (Lee et al., 26 Nov 2025).
H-Zero and URMAv2 achieve rapid adaptation and strong zero-shot performance in locomotion policies, with cross-embodiment pretraining retaining up to 81% of performance on novel robots and facilitating fine-tuning schedules 20× faster than scratch (Lin et al., 30 Nov 2025, Bohlinger et al., 2 Sep 2025).
In Meta-Controller, the matching-based adaptation network yields 71.1 average few-shot imitation scores on completely unseen embodiment–task pairs—double the best modular policy learning or transformer benchmarks in continuous control (Cho et al., 2024).
Human2LocoMan and ET-VLA show that human or synthetic pretraining can more than double or quadruple success rates—pretraining with human data yields +38.6% absolute gain overall and +82.7% on OOD quadruped manipulation (Niu et al., 19 Jun 2025), while synthetic continued pretraining with cross-sampled action tokens improves bimanual collaborative success by 53.2 points versus VLA-only (Li et al., 3 Nov 2025).

6. Theoretical Analyses, Ablations, and Limitations

Several theoretical and empirical threads clarify the benefits and boundaries of multi-embodiment pretraining:

Representational diversity, rather than increased capacity or mere vocabulary coverage, is the principal driver of performance in concatenated multi-embedding architectures (Lester et al., 2020).
In reinforcement learning, worst-case adaptation theory under KL constraints identifies the optimal unsupervised pretraining objective as minimizing the cross-embodiment trajectory KL divergence. This yields an explicit intrinsic reward for cross-embodiment skill discovery (2405.14073).
Ablations confirm that specific modular and equivariant architectural choices—attention over morphological descriptors, per-joint tokenization, geometry-aware embedding—are essential for positive transfer; monolithic or aggregated policies without explicit modularization frequently fail to benefit from cross-embodiment data (Niu et al., 19 Jun 2025, Cho et al., 2024, Chen et al., 18 Sep 2025).
The effectiveness of multi-embodiment pretraining is contingent on dataset diversity and scale; performance degrades on outlier morphologies outside the convex hull of sampled data (Lin et al., 30 Nov 2025), and insufficient data on large-action-space robots can lead to underfitting in large mixture models (Collaboration et al., 2023).
Explicit conditioning, masking, and retargeting are necessary to enable generalization to embodiments with disjoint actuation or sensing sets and to handle missing modalities (Luo et al., 19 Jan 2026, Niu et al., 19 Jun 2025).

7. Open Challenges and Prospects

Extending to truly out-of-class embodiments (e.g., dexterous hands, mobile bases, multi-arm systems) remains an unsolved problem (Collaboration et al., 2023, Luo et al., 19 Jan 2026).
Further compositionality (integration of tactile, audio, and other sensory inputs) and adaptive curriculum schedules may advance foundation-level generalization (Bohlinger et al., 2 Sep 2025, Luo et al., 19 Jan 2026).
Integration of planning modules, improved temporal abstraction, and hierarchical skill discovery could facilitate longer-horizon, multi-agent, or collaborative generalization (Li et al., 3 Nov 2025, Chen et al., 18 Sep 2025).
Quantitative metrics for the “embodiment gap” and automated diagnostics of transfer failure require further study (Collaboration et al., 2023, Lin et al., 30 Nov 2025).

In summary, multi-embodiment pretraining enables the construction of policies and representations that generalize across diverse robots, tasks, and sensor–actuator configurations. Success relies on careful action/observation alignment, explicit morphological conditioning, architectural modularity, and large-scale, heterogeneously-structured datasets. Substantial improvements in adaptation speed, sample efficiency, and cross-platform robustness are now attainable, driving the emergence of generalist policies in robotics and embodied AI.