Papers
Topics
Authors
Recent
2000 character limit reached

Diffusion Policy Co-training

Updated 27 December 2025
  • The paper introduces X-Diffusion, a diffusion framework that selectively integrates human and robot data to overcome cross-embodiment mismatches in training.
  • The method employs a diffusion process combined with a classifier-based gating mechanism to filter out nonviable human trajectories for robot policy learning.
  • Experimental results demonstrate significant gains, with task success rates improving up to 100% compared to 40-50% for traditional or naïve co-training methods.

Diffusion policy co-training refers to methods that leverage human and robot demonstration data jointly when training diffusion policy networks for robotics. While large-scale human demonstration videos are easily collected, direct use of human demonstrations for robot policy learning is fundamentally constrained by cross-embodiment mismatches in kinematics and dynamics. Naïve co-training—unifying human and robot datasets under a conventional behavior cloning loss—often degrades policy performance by encouraging physically infeasible or dynamically inappropriate actions in robotic actuators. The X-Diffusion framework exemplifies a principled approach for co-training diffusion-based policies on cross-embodiment demonstration data by exploiting the inherent structure of the diffusion process to selectively leverage human guidance only where it is compatible with robot capabilities (Pace et al., 6 Nov 2025).

1. Problem Setting and Motivation

The objective is to learn a diffusion policy πθ(Atst)\pi_\theta(A_t|s_t), which predicts an SS-step action sequence At=(at,,at+S1)A_t = (a_t,\ldots,a_{t+S-1}) from current state sts_t. Two partially overlapping datasets are available:

  • A small robot dataset DR={(st,At)}\mathcal{D}_R = \{(s_t, A_t)\} comprising expert tele-operated demonstrations on the robot hardware.
  • A large human dataset DH={(st,At)}\mathcal{D}_H = \{(s_t, A_t)\} obtained by retargeting human hand motions into the robot's end-effector space.

Naïve co-training minimizes

Lco(θ)=E(s,A)DRDH[(πθ(s),A)]L_{\text{co}}(\theta) = \mathbb{E}_{(s,A)\sim\mathcal{D}_R \cup \mathcal{D}_H}[\ell(\pi_\theta(s),A)]

but is empirically found to perform suboptimally due to mismatches in data support; human demonstrations can induce trajectories that are impossible or suboptimal for the robot agent (Pace et al., 6 Nov 2025). Selective incorporation of human data conditioned on its alignment with robot capabilities is thus central to effective co-training.

2. Diffusion Processes for Sequence Modeling

Diffusion policies model sequence generation via a forward-noising and reverse-denoising process, extending denoising diffusion probabilistic models (DDPMs) to action trajectories. For each trajectory, the forward process generates noised actions AtkA_t^k, where

  • βk(0,1)\beta_k \in (0,1) specifies a fixed noise schedule,
  • αk=1βk\alpha_k = 1-\beta_k, αˉk=i=1kαi\bar{\alpha}_k = \prod_{i=1}^k \alpha_i,
  • Atk=αˉkAt0+1αˉkϵA_t^k = \sqrt{\bar{\alpha}_k}A_t^0 + \sqrt{1-\bar{\alpha}_k}\epsilon, ϵN(0,I)\epsilon\sim\mathcal{N}(0,I).

The reverse process is parameterized by a U-Net fθf_\theta, predicting either the denoised target or added noise. Training minimizes the denoising loss

den(θ)=E[Atk1fθ(Atk,k,st)2]\ell_{\text{den}}(\theta) = \mathbb{E}\left[\|A_t^{k-1} - f_\theta(A_t^k, k, s_t)\|^2\right]

over all steps k=1Kk=1\dots K. This process enables learning policies capable of reconstructing precise sequential actions from highly perturbed initializations (Pace et al., 6 Nov 2025).

3. Cross-Embodiment Indistinguishability via Noising and Classification

To address the cross-embodiment gap, X-Diffusion proposes first training a binary classifier cϕ(k,Atk,st)=P(y=1k,Atk,st)c_\phi(k, A_t^k, s_t) = P(y=1|k,A_t^k,s_t) that predicts whether a noised action sequence originated from a robot (label y=1y=1) or a human (label y=0y=0). The classifier is optimized using binary cross-entropy: \begin{align*} L_{\text{cls}}(\phi) = \mathbb{E}{(k,A,s)\sim\mathcal{D}_R}\left[-\log c\phi(k,Ak,s)\right] \; + \ \mathbb{E}{(k,A,s)\sim\mathcal{D}_H}\left[-\log(1-c\phi(k,Ak,s))\right] \end{align*} with balanced sampling from DR\mathcal{D}_R and DH\mathcal{D}_H. Increasing the noise level kk causes low-level execution details to be suppressed; thus, for sufficiently high kk, even human-origin demonstrations become indistinguishable from robot data.

The minimum indistinguishability step for a human trajectory AA is defined as

k(A)=min{k:cϕ(k,Ak,s)0.5}k^*(A) = \min \left\{k\,:\, c_\phi(k, A^k, s) \ge 0.5\right\}

This encodes the earliest point in the forward diffusion process where the source of the trajectory cannot be reliably determined.

4. Selective Supervision and Loss Composition

Supervision is assigned by a gating function

λ(k,A)={1if (ADR) or (ADH and kk(A)) 0otherwise\lambda(k,A) = \begin{cases} 1 & \text{if } (A \in \mathcal{D}_R) \text{ or } (A \in \mathcal{D}_H \text{ and } k \geq k^*(A)) \ 0 & \text{otherwise} \end{cases}

The overall policy training loss is

LXDP(θ)=E(k,A,s)DR[Ak1fθ(Ak,k,s)2]+E(k,A,s)DH[1kk(A)Ak1fθ(Ak,k,s)2]L_{\text{XDP}}(\theta)= \mathbb{E}_{(k,A,s)\sim\mathcal{D}_R}\left[\|A^{k-1}-f_\theta(A^k,k,s)\|^2\right] +\mathbb{E}_{(k,A,s)\sim\mathcal{D}_H}\left[1_{k \geq k^*(A)}\|A^{k-1}-f_\theta(A^k,k,s)\|^2\right]

or, jointly for classifier and policy:

L(θ,ϕ)=E[λ(k,A)Ak1fθ(Ak,k,s)2+(1λ(k,A))Lcls(ϕ;k,Ak,s)]L(\theta, \phi) = \mathbb{E}\left[\lambda(k,A)\|A^{k-1}-f_\theta(A^k, k, s)\|^2 + (1-\lambda(k,A))L_{\text{cls}}(\phi; k, A^k, s)\right]

This approach ensures that only those human-trajectory fragments which have been sufficiently "blurred" by the diffusion process to appear compatible with robot actuation are actively used in policy learning, preserving high-level intent while filtering out low-level mismatches.

5. Training Procedure and Implementation

The X-Diffusion algorithm proceeds as follows:

  1. For each trajectory, precompute or sample noised sequences AkA^k for k=1Kk=1\dots K.
  2. Train classifier cϕc_\phi until validation accuracy plateaus, using balanced minibatches from DR\mathcal{D}_R and DH\mathcal{D}_H at uniformly sampled noise levels.
  3. For every human trajectory, compute k(A)k^*(A) by evaluating cϕ(k,Ak,s)c_\phi(k, A^k, s) over all kk.
  4. For policy network fθf_\theta training, randomly sample (A,s)(A,s) and kk, gate supervision with λ(k,A)\lambda(k,A), and update θ\theta via gradient descent.
  5. At inference, generate action sequences via standard reverse diffusion steps.

X-Diffusion utilizes a UNet backbone over RGB object masks (96×\times96 resolution), hand/effector keypoints, and proprioceptive state, with a ResNet-50 encoder. Human-to-robot retargeting uses HaMeR’s keypoint-based mapping, triangulation, and Kabsch alignment. The forward diffusion uses K=100K=100 training steps (with βk\beta_k linearly scheduled in [104,0.02][10^{-4},\,0.02]) and 20 inference steps. Networks are trained for 30 epochs at a learning rate of 10410^{-4}, batch size 128, and gradient clipping of 5. The classifier shares the backbone and employs an MLP head with BCE loss (Pace et al., 6 Nov 2025).

6. Experimental Results and Comparative Performance

X-Diffusion was evaluated across five real-world manipulation tasks: Serve Egg, Push Plate, Close Drawer, Mug on Rack, and Bottle Upright. Baseline methods include:

  • Diffusion Policy (robot-only)
  • Point Policy (naïve co-training in keypoint space)
  • Motion Tracks (naïve co-training in image space)
  • DemoDiffusion (one-shot human\rightarrowrobot switch during diffusion)

X-Diffusion consistently outperformed all baselines. Average task success rate increased by 16% over the strongest baseline. For instance, on Push Plate and Serve Egg, both policies trained only on robot data and those naïvely co-trained with human data achieved 40% success rate, while X-Diffusion achieved 90%. Similarly, on Mug on Rack, X-Diffusion reached 100%, compared to 50% for robot-only. An oracle baseline that manually filtered infeasible human trajectories improved naïve co-training but remained inferior to classifier-based selective co-training, demonstrating that the X-Diffusion classifier recovered feasible guidance even from otherwise misleading human data (Pace et al., 6 Nov 2025).

7. Significance and Broader Context

By exploiting the "blurring" effect of the diffusion forward process, X-Diffusion identifies at which noise levels human actions become robot-compatible and gates supervision accordingly. This maximizes leverage of large-scale, diverse human demonstration data for fine-grained robot policy learning while sidestepping the risk of learning physically unrealistic actions. A plausible implication is that similar classifier-gated, noise-dependent curriculum formulations can facilitate transfer across other major domain gaps, given an appropriate indistinguishability criterion. The approach demonstrates that careful integration of cross-embodiment data under the diffusion modeling paradigm yields robust performance gains over both expert-only and naïve co-training strategies (Pace et al., 6 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Diffusion Policy Co-training.