Unified Representations for Cross-Embodiment Transfer
- The paper introduces a unified representation framework that abstracts kinematic and dynamic specifics, allowing policies to transfer across heterogeneous robots.
- It leverages contrastive learning, vector quantization, and phase-aware regularization to align actions, observations, and reward structures across diverse embodiments.
- Empirical results show significant improvements in sample efficiency and zero/few-shot transfer, demonstrating robust performance across multiple robotics benchmarks.
Unified Representations for Cross-Embodiment Transfer
Unified representations for cross-embodiment transfer provide a principled approach to leveraging data and control policies across robots with heterogeneous morphologies, action spaces, and observation modalities. They address the core challenge of kinematic and dynamic diversity that historically limits transfer, scalability, and data efficiency in robot learning. Such representations abstract away embodiment-specific details to enable a single model, policy, or world model to operate effectively over a wide variety of agents, including humans, anthropomorphic hands, parallel-jaw grippers, manipulators, and legged robots.
1. Motivation and Conceptual Foundations
The need for unified representations emerges from the substantial embodiment gap that hinders transferability of policies, rewards, or skills between agents differing in morphology, kinematics, and sensing (He et al., 3 Nov 2025, Luo et al., 19 Jan 2026). Early approaches relying on shared-private architectures or direct mapping of joint torques between robots proved inadequate due to the combinatorial diversity of robots and the lack of scalable data collection. The central tenet is to decouple embodiment-agnostic task structure, environmental dynamics, and high-level semantic knowledge from the idiosyncrasies of individual embodiments.
Core principles underlying unified representations include:
- Embodiment-invariance: Abstracting state, action, and affordance information so that the same policy or model can interface with heterogeneous agents (He et al., 3 Nov 2025, Zhi et al., 14 Feb 2026, Yan et al., 21 Jan 2026).
- Semantic and functional alignment: Aligning skill, trajectory, reward, or outcome spaces to leverage shared functional similarity across different embodiments (Wu et al., 14 Jan 2026, Aktas et al., 2024).
- Data efficiency and scalability: Enabling few-shot or zero-shot transfer, rapid adaptation, or joint training across vast sets of robots with minimal per-embodiment fine-tuning (Zhi et al., 14 Feb 2026, Mu et al., 15 Mar 2026, Tan et al., 28 Oct 2025).
2. Model Architectures and Representation Learning
A multitude of approaches have been proposed to instantiate unified representations. These differ in the abstraction level, representation media (latent codes, natural language, geometric structures), and the type of models for policy, prediction, or reward functions.
- Latent Space Construction: Approaches such as contrastively-aligned latent action spaces (Bauer et al., 17 Jun 2025, Yan et al., 21 Jan 2026) and vector-quantized codes for action motifs (Zhi et al., 14 Feb 2026) construct a shared embedding in which actions or skills that are semantically or functionally equivalent are mapped closely together, independent of embodiment.
- Canonical Structures: Several methods, such as UniMorphGrasp (Wu et al., 31 Jan 2026), map all hand poses into a canonical human-like, high-DoF configuration, where missing DoFs are masked, and structured graph encodings capture kinematics.
- Functional Geometry and Point-Cloud Abstractions: Geometric approaches, e.g., CEI (Wu et al., 14 Jan 2026) and particle-based world models (He et al., 3 Nov 2025), reduce diverse arms and hands to point clouds and their motions to displacement fields, establishing a continuous embodiment-agnostic interface.
- Language-Centric Interfaces: Language-Action Pre-training (LAP) (Zha et al., 11 Feb 2026) and large MLLM-based vision-language-action frameworks (Tan et al., 28 Oct 2025, Luo et al., 19 Jan 2026) align actions and policy outputs to the natural language modality, ensuring seamless conditioning and transfer across agents and tasks.
- Phase and Periodicity Structures: PHASOR (Kim et al., 1 Jun 2026) factorizes motion into a cyclic phase manifold (anchored using human dynamics) and a pose branch, producing an interpretable, periodic action space across humanoids.
A summary of key approaches:
| Approach | Representation Type | Embodiment Handling |
|---|---|---|
| MOTIF (Zhi et al., 14 Feb 2026) | VQ motifs (action chunks) | Progress/adv. alignment, motif VQ |
| UniMorphGrasp (Wu et al., 31 Jan 2026) | Canonical hand config, graph | Universal DoFs, hierarchical graph mask |
| CEI (Wu et al., 14 Jan 2026) | Functional geometry (points) | DCD alignment, trajectory optimization |
| Latent Diffusion (Bauer et al., 17 Jun 2025) | Contrastive latent spaces | Paired encoders, InfoNCE objective |
| LAP (Zha et al., 11 Feb 2026) | Language-formatted actions | Language modality alignment |
| PHASOR (Kim et al., 1 Jun 2026) | Phase manifold + pose branch | Human/robot phase anchoring |
| OPFA (Mu et al., 15 Mar 2026) | Point-cloud+transformer latent | Shared encoder, universal decoder |
| OnePolicy-Fits-All | 3D conv/transformers/action latent | Universal decoder, latent retargeting |
| BLM₁/Being-H | MLLM intent embedding | Perceiver bottleneck, diffusion head |
| UniT (Chen et al., 21 Apr 2026) | Discrete action/vision tokens | Tri-branch, codebook, cross-recon |
3. Training and Cross-Embodiment Alignment Mechanisms
Unified representations are only effective if they are appropriately aligned across embodiments. The mechanisms for alignment are technically diverse:
- Contrastive Learning and Triplet Losses: Samplings of anchor-positive-negative tuples across segments and robots (e.g. left arm human, right arm robot) induce alignment under task-specific or joint-tailored similarity metrics (Yan et al., 21 Jan 2026, Bauer et al., 17 Jun 2025).
- Information Bottlenecks and VQ-VAEs: Vector quantization (VQ-VAE) and codebook assignments (e.g. in MOTIF (Zhi et al., 14 Feb 2026) and UniT (Chen et al., 21 Apr 2026)) ensure discrete, noise-resilient compression of action and/or visual features, which are directly shared across robots.
- Adversarial and Domain Confusion Losses: Domain-adversarial training (e.g. MOTIF’s gradient-reversed robot ID discriminator (Zhi et al., 14 Feb 2026)) enforces that the embedding is insufficient for predicting embodiment, filtering away device-specific signatures.
- Progress/Phase-aware Regularization: MOTIF progress-aware alignment, or PHASOR’s phase structure, synchronize motifs or actions by normalized time or phase—crucial for temporally consistent transfer (Kim et al., 1 Jun 2026, Zhi et al., 14 Feb 2026).
- Self-Supervised and Cycle Consistency: Trajectory-based and temporal cycle consistency strategies support trajectory alignment and reward function transfer even from unlabeled and mixed-quality demonstrations (Mattson et al., 2024, Xu et al., 2023).
4. Policy Architectures and Zero/Few-Shot Transfer
Unified representations are typically used to condition diffusion-based or transformer-based policy heads, enabling robust cross-embodiment generalization:
- Conditional Diffusion Policies: Action generation operates in a latent shared space, denoising from global noise while conditioning on visual/language/task and latent/action cues, often with classifier-free or tree-guided morphological guidance (Bauer et al., 17 Jun 2025, Li et al., 24 May 2026, Zhi et al., 14 Feb 2026).
- Retargeting and Decoding: OPFA (Mu et al., 15 Mar 2026), UniMorphGrasp (Wu et al., 31 Jan 2026), and latent action diffusion policies decode shared latent actions to each robot via a universal decoder, potentially zero-padding or masking missing actions.
- High-level vs. Low-level Conditioning: Policy heads may receive condensed intent (from MLLM Perceiver bottlenecks), action motifs, latent skill codes, or even natural-language formatted action templates (LAP (Zha et al., 11 Feb 2026)), supporting both low-level control and high-level reasoning.
- Reward and Value Learning: In reward learning, shared latent spaces aligned with human feedback yield embodiment-invariant reward estimators for RL policies (Mattson et al., 2024).
- Zero- or Few-Shot Results: MOTIF demonstrates 43.7% real-world improvement over strong baselines in 5-shot transfer; LAP-3B achieves >50% zero-shot success on previously unseen robots; OPFA yields up to 50%+ improvements in cross-embodiment co-training (Zhi et al., 14 Feb 2026, Zha et al., 11 Feb 2026, Mu et al., 15 Mar 2026).
5. Applications and Empirical Performance
Empirical results across simulation and real-world robots confirm the effectiveness of unified representations for cross-embodiment transfer:
| Benchmark | Method | Cross-Emb. Success Gain | Setting/Scope |
|---|---|---|---|
| ManiSkill/ARX5-Piper | MOTIF | +43.7% (5-shot) | Real 5-shot transfer |
| RoboCasa/Isaac Gym | X-DiffVLA | +15.3%, +12.5% | Grippers/hands |
| LIBERO, DROID, Robocasa | LAP-3B | >50% zero-shot | >30 robot platforms |
| MultiDex/Shadow Hand | UniMorphGrasp | 82–98% | Removal/scaling |
| Robosuite, UR5/XHand | CEI | 82.4% transfer ratio | 16 emb. in sim/real |
| Pick-and-place, 2–3 hands | Latent Diff/OPFA | up to +13% (real) | Real multi-hand setup |
| LIBERO/Kitchen/Franka | UniSkill | 81–94% (robot), 36–87% (human prompt) | Real sim/human video |
Task domains range from dexterous grasping, kitchen manipulation, deformable object shaping, spatial goal reaching, to imitation from human prompts. These methods consistently outperform baselines that lack unified, cross-embodiment representations and demonstrate scalability (e.g., OPFA: 11 different end-effectors, Being-H0.5: 30 robots/35K hours of data), sample efficiency (few-shot transfer), and compositional generalization (UniSkill, XSkill).
6. Limitations and Open Challenges
Despite their effectiveness, unified representations introduce several limitations and ongoing research challenges:
- Scaling to Arbitrary Morphologies: Handling robots with fundamentally different topologies or unconventional action spaces may require more adaptive or modular latent spaces (Mu et al., 15 Mar 2026, Wu et al., 31 Jan 2026).
- Embodiment-Specific Artifacts: Without careful regularization (e.g., adversarial constraints or codebook sharing), latent spaces may still leak embodiment details; this is partially mitigated by geometric or adversarial alignment (Zhi et al., 14 Feb 2026, Bauer et al., 17 Jun 2025).
- Data Imbalance and Observation Gaps: Asymmetries in available modalities, such as missing camera views or sensor types, degrade cross-embodiment performance (Bauer et al., 17 Jun 2025, Tan et al., 28 Oct 2025).
- Realism and Physicality: Point-cloud or particle representations may be sensitive to calibration, sensor drift, or missing geometric cues (Huang et al., 12 Nov 2025).
- Skill Compositionality and Temporal Segmentation: Most approaches segment time at fixed windows or progress phases; adaptive or hierarchical skill representations remain a research frontier (Kim et al., 13 May 2025, Xu et al., 2023).
- Reward and Affordance Transfer: Aligning reward functions or affordance representations from mixed-quality or multi-agent data requires human feedback or preference annotations, as purely unsupervised approaches may collapse on difficult or mixed datasets (Mattson et al., 2024, Aktas et al., 2024).
7. Future Directions
The research landscape is rapidly expanding toward universal, scalable frameworks for cross-embodiment robotic generalization:
- Generalist World Models: Particle-based or visual-token world models suggest a path to sequential, generative simulation for planning and policy learning agnostic to embodiment (He et al., 3 Nov 2025, Chen et al., 21 Apr 2026).
- Language and Multimodal Grounding: Integration of action, intention, and semantic information through language or MLLM backbones (LAP, BLM₁, Being-H) will further facilitate transfer by leveraging web-scale data and open-world semantics (Zha et al., 11 Feb 2026, Tan et al., 28 Oct 2025, Luo et al., 19 Jan 2026).
- Phase and Temporal Structures: Improved periodic, phase-based, or hierarchical latent encodings may capture richer compositional and feedback-dependent motion (Kim et al., 1 Jun 2026).
- Affordances and Causality: Formalizing and encoding cross-agent affordance equivalence supports robust prediction, planning, and imitation beyond kinematic or morphological alignment (Aktas et al., 2024).
- Extensibility and Specialization: Future frameworks will need mechanisms for continual, modular addition of new robots or effectors, without catastrophic forgetting or deterioration of cross-embodiment priors (Yan et al., 21 Jan 2026, Mu et al., 15 Mar 2026, Luo et al., 19 Jan 2026).
These developments collectively point toward the emergence of generalist, architecture-agnostic foundation models for physical intelligence, capable of integrating, transferring, and controlling across the full spectrum of human and robotic embodiments.