Papers
Topics
Authors
Recent
2000 character limit reached

SD2AIL: Synthetic Demonstrations for AIL

Updated 28 December 2025
  • SD2AIL is an adversarial imitation learning framework that employs diffusion models to synthesize expert-like trajectories for robust reward inference and policy optimization.
  • The framework integrates pseudo-expert generation with prioritized expert demonstration replay, effectively augmenting scarce expert datasets and improving sample efficiency.
  • Empirical results on MuJoCo tasks demonstrate that SD2AIL outperforms baselines, achieving higher stability and performance even in challenging low-data regimes.

SD2AIL (“Synthetic Demonstrations to Adversarial Imitation Learning”) is an adversarial imitation learning (AIL) framework that leverages diffusion models to generate synthetic, expert-like demonstrations for reward inference and policy optimization. SD2AIL addresses the challenge of limited expert trajectory data by augmenting small expert datasets with high-quality synthetic samples (pseudo-experts) generated via a conditional denoising diffusion probabilistic model, thus improving AIL performance and stability even in low-data regimes. This methodology is integrated into a discriminator’s learning process and further facilitated by a prioritized expert demonstration replay (PEDR) strategy, enabling scalable and robust imitation learning from sparse demonstrations (Li et al., 21 Dec 2025).

1. Background and Motivation

AIL achieves policy learning by training a discriminator, DD, to distinguish between expert and agent-generated (policy) trajectories, while the generator policy πθ\pi_\theta seeks to fool the discriminator, as in Generative Adversarial Imitation Learning (GAIL). AIL methods typically require many high-quality expert trajectories for reliable reward inference and stable agent training. However, expert data acquisition is often costly in practical settings.

Previous works have introduced diffusion models in AIL for denoising representation learning or loss refinement (notably DiffAIL and DRAIL) but have not utilized the generative capacity of diffusion models to synthesize new expert-like trajectories for direct augmentation of the expert dataset. SD2AIL introduces diffusion-based data synthesis as a core primitive to address expert data scarcity, enabling more effective and sample-efficient adversarial imitation learning.

2. Model Structure and Training Objectives

The SD2AIL algorithm comprises three central modules: (1) diffusion-enhanced discriminator DϕD_\phi, (2) agent policy πθ\pi_\theta learned using Soft Actor-Critic (SAC), and (3) replay buffers for real expert (Re\mathcal{R}_e) and pseudo-expert (Rpe\mathcal{R}_{pe}) samples.

2.1 Notation

  • S,A\mathcal{S}, \mathcal{A}: state and action spaces
  • πe\pi_e: real expert policy / dataset
  • πpe\pi_{pe}: pseudo-expert policy (diffusion-generated, filtered)
  • πθ\pi_\theta: agent policy with parameters θ\theta
  • Dϕ(s,a,ϵ)D_\phi(s, a, \epsilon): discriminator output for input (s,a)(s, a) with parameters ϕ\phi
  • TT: total diffusion steps
  • βt,αt,αˉt\beta^t, \alpha^t, \bar\alpha^t: diffusion variances and cumulative products
  • ϵϕ(x,t)\epsilon_\phi(x, t): neural network predicting diffusion noise
  • τ\tau: confidence threshold for pseudo-expert filtering
  • kk: mini-batch size of real expert samples
  • πpe:πe=7:1|\pi_{pe}| : |\pi_e| = 7:1: pseudo:real sample ratio

2.2 Diffusion Model Loss

The forward diffusion adds noise at each step: q(xtxt1)=N(xt;αtxt1,βtI)q(x^t \mid x^{t-1}) = \mathcal{N}\left(x^t ; \sqrt{\alpha^t} x^{t-1}, \beta^t I\right)

The reverse process is parameterized as: pϕ(xt1xt)=N(xt1;μϕ(xt,t),σt2I)p_\phi(x^{t-1} \mid x^t) = \mathcal{N}\left(x^{t-1} ; \mu_\phi(x^t, t), \sigma_t^2 I\right)

μϕ(xt,t)=1αt(xtβt1αˉtϵϕ(xt,t)),σt2=βt\mu_\phi(x^t, t) = \frac{1}{\sqrt{\alpha^t}} \left(x^t - \frac{\beta^t}{\sqrt{1-\bar\alpha^t}} \epsilon_\phi(x^t, t) \right), \quad \sigma_t^2 = \beta^t

The loss for diffusion training is: Ldiff(ϕ)=Et,x0,ϵN(0,I)ϵϵϕ(αˉtx0+1αˉtϵ,t)2L_{\mathrm{diff}}(\phi) = \mathbb{E}_{t, x^0, \epsilon \sim \mathcal{N}(0, I)} \left\| \epsilon - \epsilon_\phi\left(\sqrt{\bar\alpha^t} x^0 + \sqrt{1 - \bar\alpha^t} \epsilon, t\right) \right\|^2

2.3 Diffusion-Enhanced Discriminator

The discriminator integrates the diffusion loss as a confidence score: Dϕ(si,ai)=1Tt=1Texp(Lϕ(si0,ai0,t))D_\phi(s_i, a_i) = \frac{1}{T}\sum_{t=1}^{T} \exp(-L_\phi(s_i^0, a_i^0, t)) The surrogate reward for reinforcement learning is: Rϕ(s,a)=log(1Dϕ(s,a))R_\phi(s, a) = -\log(1 - D_\phi(s, a))

2.4 Adversarial Objective

The discriminator is trained to output high confidence on both real and pseudo experts and low on agent policy data: minπθmaxDϕ  E(s,a)πe[logDϕ(s,a)]+E(s,a)πpe[logDϕ(s,a)]+E(s,a)πθ[log(1Dϕ(s,a))]\min_{\pi_\theta} \max_{D_\phi} \; \mathbb{E}_{(s,a) \sim \pi_e} [\log D_\phi(s,a)] + \mathbb{E}_{(s,a) \sim \pi_{pe}} [\log D_\phi(s,a)] + \mathbb{E}_{(s,a) \sim \pi_\theta} [\log (1 - D_\phi(s,a))]

3. Synthetic Demonstration Generation and Filtering

3.1 Reverse Diffusion Sampling

After each discriminator update, pseudo-expert samples are generated by a backward diffusion chain: xt1=μϕ(xt,t)+σtϵ,ϵN(0,I)x^{t-1} = \mu_\phi(x^t, t) + \sigma_t \epsilon, \quad \epsilon \sim \mathcal{N}(0, I) yielding trajectory samples (s,a)0(s, a)^0.

3.2 Dynamic Confidence-Based Filtering

Samples are admitted to the pseudo-expert buffer only if their discriminator confidence exceeds a dynamic threshold: τ=mean{Dϕ(siπe,aiπe)}\tau = \operatorname{mean}\{D_\phi(s_i^{\pi_e}, a_i^{\pi_e})\} This enforces high quality, especially as agent learning progresses.

4. Prioritized Expert Demonstration Replay (PEDR)

PEDR enhances sample efficiency and diversity by prioritizing expert samples (real and pseudo) by their information content as measured by discriminator uncertainty.

  • For each (pseudo-)expert ii, define error δi=1Dϕ(si,ai)\delta_i = 1 - D_\phi(s_i, a_i) and priority pi=δip_i = |\delta_i|.
  • Sampling probability:

P(i)=piζkpkζ,ζ>0P(i) = \frac{p_i^\zeta}{\sum_k p_k^\zeta}, \quad \zeta > 0

  • Importance weighting:

wi=(1N1P(i))η,η annealed 0.41.0w_i = \left(\frac{1}{N} \frac{1}{P(i)}\right)^\eta, \quad \eta \text{ annealed } 0.4 \to 1.0

  • Discriminator loss with PEDR:

L(ϕ)=iπeπpewiLBCE(logDϕ(si,ai))+E(s,a)πθ[LBCE(log(1Dϕ(s,a)))]\mathcal{L}(\phi) = \sum_{i \in \pi_e \cup \pi_{pe}} w_i \mathcal{L}_{\mathrm{BCE}}(-\log D_\phi(s_i, a_i)) + \mathbb{E}_{(s,a)\sim\pi_\theta}\left[\mathcal{L}_{\mathrm{BCE}}(-\log(1-D_\phi(s,a)))\right]

5. Algorithm Details and Implementation

  • Diffusion steps: T=10T = 10 with linear βt\beta^t scheduler
  • Mini-batch: pseudo:real expert ratio $7:1$ (k=64k = 64)
  • Replay buffers: per-trajectory for experts; pooled for pseudo-experts
  • Networks:
    • Policy πθ\pi_\theta: MLP (256 units × 2 layers, ReLU, Gaussian action heads)
    • Discriminator ϕ\phi: UNet-style encoder + MLP binary classifier
    • Diffusion noise predictor ϵϕ\epsilon_\phi: shares UNet backbone
  • Optimizer: Adam, learning rate 1e41\mathrm{e}{-4}
  • PEDR parameters: ζ=0.7\zeta=0.7, η\eta annealed 0.41.00.4\to1.0
  • Hardware: 3× NVIDIA RTX A6000 GPUs

Pseudo-code (condensed):

  1. Collect agent transitions via πθ\pi_\theta
  2. Sample kk real + $7k$ pseudo-expert samples via PEDR
  3. Compute diffusion and discriminator losses
  4. Update ϕ\phi via gradients
  5. Update PEDR priorities, sample new pseudo-experts, filter by confidence, add to buffer
  6. Compute rewards, update policy via SAC

6. Empirical Results and Analysis

Experiments were conducted on four MuJoCo continuous control tasks (Ant, Hopper, Walker2d, HalfCheetah) with expert datasets of 40 trajectories × 1000 steps, considering low-data settings with 1, 4, or 16 expert trajectories. SD2AIL outperformed baselines such as BC, GAIL, DiffAIL, DRAIL, and SMILING, particularly in 1-trajectory regimes:

Task Expert DiffAIL DRAIL SMILING SD2AIL
Ant 4228 4901 5032 4785 5345
Hopper 3402 3275 3189 3301 3441
Walker2d 5620 5250 5345 5180 5743
HalfCheetah 4663 5600 5720 5501 5885

Ablations showed optimal performance at T=10T=10 diffusion steps, with Fréchet Distance between pseudo and real expert features reduced to 85.4 (compared to 304.7 for random policy) over training. Surrogate reward correlation with true reward achieved 93.0%, 90.1%, 92.3%, 85.2% for SD2AIL across tasks (exceeding DiffAIL). Component ablations on Walker2d established that both pseudo-expert generation and PEDR contribute to final performance (peak return: pseudo-expert only 4557, PEDR only 4907, combined SD2AIL 5743).

7. Discussion, Limitations, and Reproducibility

The principal insight is that diffusion models generate high-diversity, high-fidelity expert-like trajectories, thus addressing the limited support of small real expert datasets and giving the discriminator a more accurate reward boundary. PEDR further prioritizes difficult or uncertain samples, enhancing data efficiency. Limitations include increased wall-clock time due to diffusion sampling; the method remains amenable to acceleration with more efficient diffusion samplers. Empirical results are based on simulated environments; extension to real-world robotics and broader datasets remains open.

SD2AIL is fully reproducible: source code (PyTorch ≥1.10) is provided at https://github.com/positron-lpc/SD2AIL, compatible with MuJoCo and D4RL expert datasets. Standard training is invoked as python train_sd2ail.py --env Hopper --num_traj 1 --T 10 (Li et al., 21 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to SD2AIL.