Diffusion Policy Co-training

Updated 27 December 2025

The paper introduces X-Diffusion, a diffusion framework that selectively integrates human and robot data to overcome cross-embodiment mismatches in training.
The method employs a diffusion process combined with a classifier-based gating mechanism to filter out nonviable human trajectories for robot policy learning.
Experimental results demonstrate significant gains, with task success rates improving up to 100% compared to 40-50% for traditional or naïve co-training methods.

Diffusion policy co-training refers to methods that leverage human and robot demonstration data jointly when training diffusion policy networks for robotics. While large-scale human demonstration videos are easily collected, direct use of human demonstrations for robot policy learning is fundamentally constrained by cross-embodiment mismatches in kinematics and dynamics. Naïve co-training—unifying human and robot datasets under a conventional behavior cloning loss—often degrades policy performance by encouraging physically infeasible or dynamically inappropriate actions in robotic actuators. The X-Diffusion framework exemplifies a principled approach for co-training diffusion-based policies on cross-embodiment demonstration data by exploiting the inherent structure of the diffusion process to selectively leverage human guidance only where it is compatible with robot capabilities (Pace et al., 6 Nov 2025).

1. Problem Setting and Motivation

The objective is to learn a diffusion policy $\pi_\theta(A_t|s_t)$ , which predicts an $S$ -step action sequence $A_t = (a_t,\ldots,a_{t+S-1})$ from current state $s_t$ . Two partially overlapping datasets are available:

A small robot dataset $\mathcal{D}_R = \{(s_t, A_t)\}$ comprising expert tele-operated demonstrations on the robot hardware.
A large human dataset $\mathcal{D}_H = \{(s_t, A_t)\}$ obtained by retargeting human hand motions into the robot's end-effector space.

Naïve co-training minimizes

$L_{\text{co}}(\theta) = \mathbb{E}_{(s,A)\sim\mathcal{D}_R \cup \mathcal{D}_H}[\ell(\pi_\theta(s),A)]$

but is empirically found to perform suboptimally due to mismatches in data support; human demonstrations can induce trajectories that are impossible or suboptimal for the robot agent (Pace et al., 6 Nov 2025). Selective incorporation of human data conditioned on its alignment with robot capabilities is thus central to effective co-training.

2. Diffusion Processes for Sequence Modeling

Diffusion policies model sequence generation via a forward-noising and reverse-denoising process, extending denoising diffusion probabilistic models (DDPMs) to action trajectories. For each trajectory, the forward process generates noised actions $A_t^k$ , where

$\beta_k \in (0,1)$ specifies a fixed noise schedule,
$\alpha_k = 1-\beta_k$ , $\bar{\alpha}_k = \prod_{i=1}^k \alpha_i$ ,
$A_t^k = \sqrt{\bar{\alpha}_k}A_t^0 + \sqrt{1-\bar{\alpha}_k}\epsilon$ , $\epsilon\sim\mathcal{N}(0,I)$ .

The reverse process is parameterized by a U-Net $f_\theta$ , predicting either the denoised target or added noise. Training minimizes the denoising loss

$\ell_{\text{den}}(\theta) = \mathbb{E}\left[\|A_t^{k-1} - f_\theta(A_t^k, k, s_t)\|^2\right]$

over all steps $k=1\dots K$ . This process enables learning policies capable of reconstructing precise sequential actions from highly perturbed initializations (Pace et al., 6 Nov 2025).

3. Cross-Embodiment Indistinguishability via Noising and Classification

To address the cross-embodiment gap, X-Diffusion proposes first training a binary classifier $c_\phi(k, A_t^k, s_t) = P(y=1|k,A_t^k,s_t)$ that predicts whether a noised action sequence originated from a robot (label $y=1$ ) or a human (label $y=0$ ). The classifier is optimized using binary cross-entropy: \begin{align*} L_{\text{cls}}(\phi) = \mathbb{E}{(k,A,s)\sim\mathcal{D}_R}\left[-\log c\phi(k,A^k,s)\right] \; + \ \mathbb{E}{(k,A,s)\sim\mathcal{D}_H}\left[-\log(1-c\phi(k,A^k,s))\right] \end{align*} with balanced sampling from $\mathcal{D}_R$ and $\mathcal{D}_H$ . Increasing the noise level $k$ causes low-level execution details to be suppressed; thus, for sufficiently high $k$ , even human-origin demonstrations become indistinguishable from robot data.

The minimum indistinguishability step for a human trajectory $A$ is defined as

$k^*(A) = \min \left\{k\,:\, c_\phi(k, A^k, s) \ge 0.5\right\}$

This encodes the earliest point in the forward diffusion process where the source of the trajectory cannot be reliably determined.

4. Selective Supervision and Loss Composition

Supervision is assigned by a gating function

$\lambda(k,A) = \begin{cases} 1 & \text{if } (A \in \mathcal{D}_R) \text{ or } (A \in \mathcal{D}_H \text{ and } k \geq k^*(A)) \ 0 & \text{otherwise} \end{cases}$

The overall policy training loss is

$L_{\text{XDP}}(\theta)= \mathbb{E}_{(k,A,s)\sim\mathcal{D}_R}\left[\|A^{k-1}-f_\theta(A^k,k,s)\|^2\right] +\mathbb{E}_{(k,A,s)\sim\mathcal{D}_H}\left[1_{k \geq k^*(A)}\|A^{k-1}-f_\theta(A^k,k,s)\|^2\right]$

or, jointly for classifier and policy:

$L(\theta, \phi) = \mathbb{E}\left[\lambda(k,A)\|A^{k-1}-f_\theta(A^k, k, s)\|^2 + (1-\lambda(k,A))L_{\text{cls}}(\phi; k, A^k, s)\right]$

This approach ensures that only those human-trajectory fragments which have been sufficiently "blurred" by the diffusion process to appear compatible with robot actuation are actively used in policy learning, preserving high-level intent while filtering out low-level mismatches.

5. Training Procedure and Implementation

The X-Diffusion algorithm proceeds as follows:

For each trajectory, precompute or sample noised sequences $A^k$ for $k=1\dots K$ .
Train classifier $c_\phi$ until validation accuracy plateaus, using balanced minibatches from $\mathcal{D}_R$ and $\mathcal{D}_H$ at uniformly sampled noise levels.
For every human trajectory, compute $k^*(A)$ by evaluating $c_\phi(k, A^k, s)$ over all $k$ .
For policy network $f_\theta$ training, randomly sample $(A,s)$ and $k$ , gate supervision with $\lambda(k,A)$ , and update $\theta$ via gradient descent.
At inference, generate action sequences via standard reverse diffusion steps.

X-Diffusion utilizes a UNet backbone over RGB object masks (96 $\times$ 96 resolution), hand/effector keypoints, and proprioceptive state, with a ResNet-50 encoder. Human-to-robot retargeting uses HaMeR’s keypoint-based mapping, triangulation, and Kabsch alignment. The forward diffusion uses $K=100$ training steps (with $\beta_k$ linearly scheduled in $[10^{-4},\,0.02]$ ) and 20 inference steps. Networks are trained for 30 epochs at a learning rate of $10^{-4}$ , batch size 128, and gradient clipping of 5. The classifier shares the backbone and employs an MLP head with BCE loss (Pace et al., 6 Nov 2025).

6. Experimental Results and Comparative Performance

X-Diffusion was evaluated across five real-world manipulation tasks: Serve Egg, Push Plate, Close Drawer, Mug on Rack, and Bottle Upright. Baseline methods include:

Diffusion Policy (robot-only)
Point Policy (naïve co-training in keypoint space)
Motion Tracks (naïve co-training in image space)
DemoDiffusion (one-shot human $\rightarrow$ robot switch during diffusion)

X-Diffusion consistently outperformed all baselines. Average task success rate increased by 16% over the strongest baseline. For instance, on Push Plate and Serve Egg, both policies trained only on robot data and those naïvely co-trained with human data achieved 40% success rate, while X-Diffusion achieved 90%. Similarly, on Mug on Rack, X-Diffusion reached 100%, compared to 50% for robot-only. An oracle baseline that manually filtered infeasible human trajectories improved naïve co-training but remained inferior to classifier-based selective co-training, demonstrating that the X-Diffusion classifier recovered feasible guidance even from otherwise misleading human data (Pace et al., 6 Nov 2025).

7. Significance and Broader Context

By exploiting the "blurring" effect of the diffusion forward process, X-Diffusion identifies at which noise levels human actions become robot-compatible and gates supervision accordingly. This maximizes leverage of large-scale, diverse human demonstration data for fine-grained robot policy learning while sidestepping the risk of learning physically unrealistic actions. A plausible implication is that similar classifier-gated, noise-dependent curriculum formulations can facilitate transfer across other major domain gaps, given an appropriate indistinguishability criterion. The approach demonstrates that careful integration of cross-embodiment data under the diffusion modeling paradigm yields robust performance gains over both expert-only and naïve co-training strategies (Pace et al., 6 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Diffusion Policy Co-training.