Diffusion Policy Co-training
- The paper introduces X-Diffusion, a diffusion framework that selectively integrates human and robot data to overcome cross-embodiment mismatches in training.
- The method employs a diffusion process combined with a classifier-based gating mechanism to filter out nonviable human trajectories for robot policy learning.
- Experimental results demonstrate significant gains, with task success rates improving up to 100% compared to 40-50% for traditional or naïve co-training methods.
Diffusion policy co-training refers to methods that leverage human and robot demonstration data jointly when training diffusion policy networks for robotics. While large-scale human demonstration videos are easily collected, direct use of human demonstrations for robot policy learning is fundamentally constrained by cross-embodiment mismatches in kinematics and dynamics. Naïve co-training—unifying human and robot datasets under a conventional behavior cloning loss—often degrades policy performance by encouraging physically infeasible or dynamically inappropriate actions in robotic actuators. The X-Diffusion framework exemplifies a principled approach for co-training diffusion-based policies on cross-embodiment demonstration data by exploiting the inherent structure of the diffusion process to selectively leverage human guidance only where it is compatible with robot capabilities (Pace et al., 6 Nov 2025).
1. Problem Setting and Motivation
The objective is to learn a diffusion policy , which predicts an -step action sequence from current state . Two partially overlapping datasets are available:
- A small robot dataset comprising expert tele-operated demonstrations on the robot hardware.
- A large human dataset obtained by retargeting human hand motions into the robot's end-effector space.
Naïve co-training minimizes
but is empirically found to perform suboptimally due to mismatches in data support; human demonstrations can induce trajectories that are impossible or suboptimal for the robot agent (Pace et al., 6 Nov 2025). Selective incorporation of human data conditioned on its alignment with robot capabilities is thus central to effective co-training.
2. Diffusion Processes for Sequence Modeling
Diffusion policies model sequence generation via a forward-noising and reverse-denoising process, extending denoising diffusion probabilistic models (DDPMs) to action trajectories. For each trajectory, the forward process generates noised actions , where
- specifies a fixed noise schedule,
- , ,
- , .
The reverse process is parameterized by a U-Net , predicting either the denoised target or added noise. Training minimizes the denoising loss
over all steps . This process enables learning policies capable of reconstructing precise sequential actions from highly perturbed initializations (Pace et al., 6 Nov 2025).
3. Cross-Embodiment Indistinguishability via Noising and Classification
To address the cross-embodiment gap, X-Diffusion proposes first training a binary classifier that predicts whether a noised action sequence originated from a robot (label ) or a human (label ). The classifier is optimized using binary cross-entropy: \begin{align*} L_{\text{cls}}(\phi) = \mathbb{E}{(k,A,s)\sim\mathcal{D}_R}\left[-\log c\phi(k,Ak,s)\right] \; + \ \mathbb{E}{(k,A,s)\sim\mathcal{D}_H}\left[-\log(1-c\phi(k,Ak,s))\right] \end{align*} with balanced sampling from and . Increasing the noise level causes low-level execution details to be suppressed; thus, for sufficiently high , even human-origin demonstrations become indistinguishable from robot data.
The minimum indistinguishability step for a human trajectory is defined as
This encodes the earliest point in the forward diffusion process where the source of the trajectory cannot be reliably determined.
4. Selective Supervision and Loss Composition
Supervision is assigned by a gating function
The overall policy training loss is
or, jointly for classifier and policy:
This approach ensures that only those human-trajectory fragments which have been sufficiently "blurred" by the diffusion process to appear compatible with robot actuation are actively used in policy learning, preserving high-level intent while filtering out low-level mismatches.
5. Training Procedure and Implementation
The X-Diffusion algorithm proceeds as follows:
- For each trajectory, precompute or sample noised sequences for .
- Train classifier until validation accuracy plateaus, using balanced minibatches from and at uniformly sampled noise levels.
- For every human trajectory, compute by evaluating over all .
- For policy network training, randomly sample and , gate supervision with , and update via gradient descent.
- At inference, generate action sequences via standard reverse diffusion steps.
X-Diffusion utilizes a UNet backbone over RGB object masks (9696 resolution), hand/effector keypoints, and proprioceptive state, with a ResNet-50 encoder. Human-to-robot retargeting uses HaMeR’s keypoint-based mapping, triangulation, and Kabsch alignment. The forward diffusion uses training steps (with linearly scheduled in ) and 20 inference steps. Networks are trained for 30 epochs at a learning rate of , batch size 128, and gradient clipping of 5. The classifier shares the backbone and employs an MLP head with BCE loss (Pace et al., 6 Nov 2025).
6. Experimental Results and Comparative Performance
X-Diffusion was evaluated across five real-world manipulation tasks: Serve Egg, Push Plate, Close Drawer, Mug on Rack, and Bottle Upright. Baseline methods include:
- Diffusion Policy (robot-only)
- Point Policy (naïve co-training in keypoint space)
- Motion Tracks (naïve co-training in image space)
- DemoDiffusion (one-shot humanrobot switch during diffusion)
X-Diffusion consistently outperformed all baselines. Average task success rate increased by 16% over the strongest baseline. For instance, on Push Plate and Serve Egg, both policies trained only on robot data and those naïvely co-trained with human data achieved 40% success rate, while X-Diffusion achieved 90%. Similarly, on Mug on Rack, X-Diffusion reached 100%, compared to 50% for robot-only. An oracle baseline that manually filtered infeasible human trajectories improved naïve co-training but remained inferior to classifier-based selective co-training, demonstrating that the X-Diffusion classifier recovered feasible guidance even from otherwise misleading human data (Pace et al., 6 Nov 2025).
7. Significance and Broader Context
By exploiting the "blurring" effect of the diffusion forward process, X-Diffusion identifies at which noise levels human actions become robot-compatible and gates supervision accordingly. This maximizes leverage of large-scale, diverse human demonstration data for fine-grained robot policy learning while sidestepping the risk of learning physically unrealistic actions. A plausible implication is that similar classifier-gated, noise-dependent curriculum formulations can facilitate transfer across other major domain gaps, given an appropriate indistinguishability criterion. The approach demonstrates that careful integration of cross-embodiment data under the diffusion modeling paradigm yields robust performance gains over both expert-only and naïve co-training strategies (Pace et al., 6 Nov 2025).