- The paper introduces a diffusion framework that overcomes the embodiment gap by aligning diverse action representations in a shared latent space using contrastive and reconstruction losses.
- It employs separate encoders and decoders for different end-effectors, enabling successful skill transfer and improved pick-and-place performance across varied robotic morphologies.
- The framework demonstrates up to 13% performance improvement while addressing challenges such as retargeting dependency and asymmetric observation spaces.
This paper, "Latent Action Diffusion for Cross-Embodiment Manipulation" (2506.14608), addresses the challenge of enabling robots with different end-effectors (like dexterous hands and parallel grippers) to learn from shared data and transfer skills. The core problem is the heterogeneity of action spaces across embodiments, which creates an "embodiment gap" that hinders traditional cross-embodiment learning approaches. The authors propose a novel framework using a learned latent action space and diffusion policies to overcome this.
The practical implementation involves a four-stage pipeline:
- Generating Aligned Action Pairs: The first step is to create paired action data across different end-effectors that are semantically aligned. This is achieved by leveraging existing retargeting functions, typically used to map human hand poses to robot hand poses for teleoperation. For example, a human hand pose (θH) can be retargeted to a robot hand pose (θR) using a function fR, creating pairs (θH,fR(θH)). The paper uses a method based on aligning "keyvectors" (vectors between palm and fingertips, or between fingertips) for different hands, optimizing robot joint angles to minimize the difference in keyvectors scaled by finger length ratios. For a parallel gripper, a heuristic is used based on the minimum keyvector originating from the human thumb, normalized to a gripper width range [0,1]. Action representations vary by embodiment: 189 dimensions for human hand joints (using 6D rotation representation), 11 or 16 dimensions for dexterous robot hands (Faive, mimic) as joint angles, and 1 dimension for a parallel gripper (normalized width).
- Implementation Detail: This stage requires pre-existing retargeting mechanisms or developing them for the specific embodiments being used. The keyvector approach described can be implemented by modeling the forward kinematics of each hand/gripper and setting up an optimization problem (for dexterous hands) or a heuristic (for grippers) based on these kinematics and defined keypoints.
- Learning a Cross-Embodiment Latent Action Space: Modality-specific encoders are trained to map the diverse explicit action representations from different end-effectors into a shared latent space. These encoders are implemented as Multi-Layer Perceptrons (MLPs). The learning objective is a pairwise InfoNCE loss [20] applied to batches of aligned action pairs. This contrastive loss pulls the latent representations of corresponding actions closer together while pushing non-corresponding representations apart, ensuring semantic alignment in the latent space.
- Implementation Detail: Implement separate MLPs (encoders) for each embodiment. The input dimension of each encoder matches the explicit action dimension of its corresponding embodiment (e.g., 189 for human, 11/16 for dexterous robot, 1 for gripper). The output dimension is the chosen latent space dimension (e.g., 128 or 16 in experiments). The InfoNCE loss requires batching aligned pairs. The temperature parameter τ in the InfoNCE loss (Equation 4) is crucial for controlling the sharpness of the distribution and finding good representations; the paper suggests annealing the temperature during training.
- Training Modality-Specific Decoders + Finetuning Encoders: After the latent space is initially structured by the contrastive loss, modality-specific decoders (also MLPs) are trained to reconstruct the original explicit actions from their latent representations. Simultaneously, the encoders are fine-tuned with a lower learning rate. The total loss for this stage combines a reconstruction loss (L2 loss between decoded and ground truth actions) and the contrastive loss from the previous stage (weighted by a hyperparameter λ). This two-step process helps ensure that the latent space not only aligns semantically but also retains sufficient information for accurate reconstruction of the original actions for each embodiment.
- Implementation Detail: Implement separate MLPs (decoders) for each embodiment. The input dimension of each decoder is the latent space dimension, and the output dimension matches the explicit action dimension. The combined loss function is Ltotal=Lrecon+λLcontrastive (Equation 6). Training involves optimizing both encoder and decoder parameters, but fine-tuning the encoders with a smaller learning rate helps stabilize training and prevent the encoders from collapsing during reconstruction training. The choice of λ balances reconstruction fidelity and latent space alignment.
- Cross-Embodiment Latent Space Imitation Learning: With the learned encoders and decoders defining the latent space, a single diffusion policy is trained to predict latent actions from observations. This policy is embodiment-agnostic as it operates solely within the shared latent space. The policy takes observations (like external RGB camera images and potentially robot arm pose) and predicts a denoised version of a noisy latent action, following the principles of Diffusion Policy [21]. The policy architecture can be Transformer-based [22] or U-Net-based [21]. During inference on a specific robot embodiment, the policy predicts a latent action, which is then passed through the corresponding modality-specific decoder to obtain the explicit action commands for that robot.
- Implementation Detail: Choose a diffusion policy architecture (Transformer or U-Net). Implement the policy network to take observation tokens (e.g., image embeddings from a ResNet or Vision Transformer [25, 27], arm pose embeddings) and noisy latent action tokens, along with a diffusion timestep token, as input. The output is a predicted denoised latent action. Observations are typically conditioned on using mechanisms like FiLM layers or concatenation within a Transformer. Training data from multiple embodiments are combined by projecting their explicit actions into the latent space using the learned encoders. Batches for training can be formed by sampling sub-batches from different datasets proportional to their assigned weights. At inference, after the policy outputs a latent action, the correct decoder (corresponding to the robot being controlled) is selected to convert the latent action back into the robot's specific joint angles or gripper width.
The paper validates this framework on pick-and-place tasks using a Franka parallel gripper and two dexterous hands (Faive and mimic). Experiments show that training a single policy on a mix of data from different embodiments in the latent space improves performance (up to 13% increase in success rates) and robustness compared to single-embodiment policies, demonstrating successful skill transfer despite significant morphological differences.
Practical Considerations and Limitations:
- Retargeting Dependency: The quality of the learned latent space alignment depends heavily on the quality of the initial retargeting functions used to generate paired data. Developing robust retargeting for highly diverse embodiments can be challenging.
- Latent Space Properties: The learned latent space using contrastive learning does not inherently guarantee properties like smoothness, which could potentially make it harder for the downstream policy to model. VAE-based approaches could offer regularization but weren't used here. Empirical evaluation of reconstruction and cross-reconstruction errors is crucial before policy training.
- Asymmetric Observations: The framework struggles with significant asymmetries in observations between embodiments (e.g., one robot having an extra camera view like a wrist camera, while another doesn't). This suggests that observation spaces also need careful consideration and potentially alignment or masking strategies for effective cross-embodiment learning.
- Asymmetric Dataset Sizes: Combining datasets with vastly different sizes remains a challenge. Large, diverse datasets like BridgeData V2 [23] or DexYCB [24] could not be effectively combined with smaller datasets in this framework due to visual differences and scale asymmetry.
- Computational Requirements: Training separate encoders and decoders for the action space and then training a large diffusion model policy requires substantial computational resources, especially with high-dimensional observation spaces (e.g., images) and diffusion steps.
- Hyperparameter Tuning: The training process involves several critical hyperparameters for both the contrastive action model (batch size, learning rates, weight decay, temperature schedule, λ, latent dimension, hidden dimensions) and the diffusion policy (batch size, horizon, noise schedule, learning rate schedule, optimizer, training steps). Tuning these can be time-consuming.
- Scalability: While demonstrated on three end-effectors, scaling to a much larger and more diverse set of robot morphologies would require developing retargeting or alignment mechanisms for all pairs or from a common source (like human data), and potentially larger model capacities.
Despite these limitations, the paper presents a practical and effective approach for tackling the action space embodiment gap, enabling a single policy to control multiple robots with diverse end-effectors and facilitating skill transfer through a learned, semantically aligned latent representation. The methodology provides a clear path for unifying data from heterogeneous robot systems for more efficient learning.