SD2AIL: Synthetic Demonstrations for AIL

Updated 28 December 2025

SD2AIL is an adversarial imitation learning framework that employs diffusion models to synthesize expert-like trajectories for robust reward inference and policy optimization.
The framework integrates pseudo-expert generation with prioritized expert demonstration replay, effectively augmenting scarce expert datasets and improving sample efficiency.
Empirical results on MuJoCo tasks demonstrate that SD2AIL outperforms baselines, achieving higher stability and performance even in challenging low-data regimes.

SD2AIL (“Synthetic Demonstrations to Adversarial Imitation Learning”) is an adversarial imitation learning (AIL) framework that leverages diffusion models to generate synthetic, expert-like demonstrations for reward inference and policy optimization. SD2AIL addresses the challenge of limited expert trajectory data by augmenting small expert datasets with high-quality synthetic samples (pseudo-experts) generated via a conditional denoising diffusion probabilistic model, thus improving AIL performance and stability even in low-data regimes. This methodology is integrated into a discriminator’s learning process and further facilitated by a prioritized expert demonstration replay (PEDR) strategy, enabling scalable and robust imitation learning from sparse demonstrations (Li et al., 21 Dec 2025).

1. Background and Motivation

AIL achieves policy learning by training a discriminator, $D$ , to distinguish between expert and agent-generated (policy) trajectories, while the generator policy $\pi_\theta$ seeks to fool the discriminator, as in Generative Adversarial Imitation Learning (GAIL). AIL methods typically require many high-quality expert trajectories for reliable reward inference and stable agent training. However, expert data acquisition is often costly in practical settings.

Previous works have introduced diffusion models in AIL for denoising representation learning or loss refinement (notably DiffAIL and DRAIL) but have not utilized the generative capacity of diffusion models to synthesize new expert-like trajectories for direct augmentation of the expert dataset. SD2AIL introduces diffusion-based data synthesis as a core primitive to address expert data scarcity, enabling more effective and sample-efficient adversarial imitation learning.

2. Model Structure and Training Objectives

The SD2AIL algorithm comprises three central modules: (1) diffusion-enhanced discriminator $D_\phi$ , (2) agent policy $\pi_\theta$ learned using Soft Actor-Critic (SAC), and (3) replay buffers for real expert ( $\mathcal{R}_e$ ) and pseudo-expert ( $\mathcal{R}_{pe}$ ) samples.

2.1 Notation

$\mathcal{S}, \mathcal{A}$ : state and action spaces
$\pi_e$ : real expert policy / dataset
$\pi_{pe}$ : pseudo-expert policy (diffusion-generated, filtered)
$\pi_\theta$ : agent policy with parameters $\theta$
$D_\phi(s, a, \epsilon)$ : discriminator output for input $(s, a)$ with parameters $\phi$
$T$ : total diffusion steps
$\beta^t, \alpha^t, \bar\alpha^t$ : diffusion variances and cumulative products
$\epsilon_\phi(x, t)$ : neural network predicting diffusion noise
$\tau$ : confidence threshold for pseudo-expert filtering
$k$ : mini-batch size of real expert samples
$|\pi_{pe}| : |\pi_e| = 7:1$ : pseudo:real sample ratio

2.2 Diffusion Model Loss

The forward diffusion adds noise at each step: $q(x^t \mid x^{t-1}) = \mathcal{N}\left(x^t ; \sqrt{\alpha^t} x^{t-1}, \beta^t I\right)$

The reverse process is parameterized as: $p_\phi(x^{t-1} \mid x^t) = \mathcal{N}\left(x^{t-1} ; \mu_\phi(x^t, t), \sigma_t^2 I\right)$

$\mu_\phi(x^t, t) = \frac{1}{\sqrt{\alpha^t}} \left(x^t - \frac{\beta^t}{\sqrt{1-\bar\alpha^t}} \epsilon_\phi(x^t, t) \right), \quad \sigma_t^2 = \beta^t$

The loss for diffusion training is: $L_{\mathrm{diff}}(\phi) = \mathbb{E}_{t, x^0, \epsilon \sim \mathcal{N}(0, I)} \left\| \epsilon - \epsilon_\phi\left(\sqrt{\bar\alpha^t} x^0 + \sqrt{1 - \bar\alpha^t} \epsilon, t\right) \right\|^2$

2.3 Diffusion-Enhanced Discriminator

The discriminator integrates the diffusion loss as a confidence score: $D_\phi(s_i, a_i) = \frac{1}{T}\sum_{t=1}^{T} \exp(-L_\phi(s_i^0, a_i^0, t))$ The surrogate reward for reinforcement learning is: $R_\phi(s, a) = -\log(1 - D_\phi(s, a))$

2.4 Adversarial Objective

The discriminator is trained to output high confidence on both real and pseudo experts and low on agent policy data: $\min_{\pi_\theta} \max_{D_\phi} \; \mathbb{E}_{(s,a) \sim \pi_e} [\log D_\phi(s,a)] + \mathbb{E}_{(s,a) \sim \pi_{pe}} [\log D_\phi(s,a)] + \mathbb{E}_{(s,a) \sim \pi_\theta} [\log (1 - D_\phi(s,a))]$

3. Synthetic Demonstration Generation and Filtering

3.1 Reverse Diffusion Sampling

After each discriminator update, pseudo-expert samples are generated by a backward diffusion chain: $x^{t-1} = \mu_\phi(x^t, t) + \sigma_t \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$ yielding trajectory samples $(s, a)^0$ .

3.2 Dynamic Confidence-Based Filtering

Samples are admitted to the pseudo-expert buffer only if their discriminator confidence exceeds a dynamic threshold: $\tau = \operatorname{mean}\{D_\phi(s_i^{\pi_e}, a_i^{\pi_e})\}$ This enforces high quality, especially as agent learning progresses.

4. Prioritized Expert Demonstration Replay (PEDR)

PEDR enhances sample efficiency and diversity by prioritizing expert samples (real and pseudo) by their information content as measured by discriminator uncertainty.

For each (pseudo-)expert $i$ , define error $\delta_i = 1 - D_\phi(s_i, a_i)$ and priority $p_i = |\delta_i|$ .
Sampling probability:

$P(i) = \frac{p_i^\zeta}{\sum_k p_k^\zeta}, \quad \zeta > 0$

Importance weighting:

$w_i = \left(\frac{1}{N} \frac{1}{P(i)}\right)^\eta, \quad \eta \text{ annealed } 0.4 \to 1.0$

Discriminator loss with PEDR:

$\mathcal{L}(\phi) = \sum_{i \in \pi_e \cup \pi_{pe}} w_i \mathcal{L}_{\mathrm{BCE}}(-\log D_\phi(s_i, a_i)) + \mathbb{E}_{(s,a)\sim\pi_\theta}\left[\mathcal{L}_{\mathrm{BCE}}(-\log(1-D_\phi(s,a)))\right]$

5. Algorithm Details and Implementation

Diffusion steps: $T = 10$ with linear $\beta^t$ scheduler
Mini-batch: pseudo:real expert ratio $7:1$ ( $k = 64$ )
Replay buffers: per-trajectory for experts; pooled for pseudo-experts
Networks:
- Policy $\pi_\theta$ : MLP (256 units × 2 layers, ReLU, Gaussian action heads)
- Discriminator $\phi$ : UNet-style encoder + MLP binary classifier
- Diffusion noise predictor $\epsilon_\phi$ : shares UNet backbone
Optimizer: Adam, learning rate $1\mathrm{e}{-4}$
PEDR parameters: $\zeta=0.7$ , $\eta$ annealed $0.4\to1.0$
Hardware: 3× NVIDIA RTX A6000 GPUs

Pseudo-code (condensed):

Collect agent transitions via $\pi_\theta$
Sample $k$ real + $7k$ pseudo-expert samples via PEDR
Compute diffusion and discriminator losses
Update $\phi$ via gradients
Update PEDR priorities, sample new pseudo-experts, filter by confidence, add to buffer
Compute rewards, update policy via SAC

6. Empirical Results and Analysis

Experiments were conducted on four MuJoCo continuous control tasks (Ant, Hopper, Walker2d, HalfCheetah) with expert datasets of 40 trajectories × 1000 steps, considering low-data settings with 1, 4, or 16 expert trajectories. SD2AIL outperformed baselines such as BC, GAIL, DiffAIL, DRAIL, and SMILING, particularly in 1-trajectory regimes:

Task	Expert	DiffAIL	DRAIL	SMILING	SD2AIL
Ant	4228	4901	5032	4785	5345
Hopper	3402	3275	3189	3301	3441
Walker2d	5620	5250	5345	5180	5743
HalfCheetah	4663	5600	5720	5501	5885

Ablations showed optimal performance at $T=10$ diffusion steps, with Fréchet Distance between pseudo and real expert features reduced to 85.4 (compared to 304.7 for random policy) over training. Surrogate reward correlation with true reward achieved 93.0%, 90.1%, 92.3%, 85.2% for SD2AIL across tasks (exceeding DiffAIL). Component ablations on Walker2d established that both pseudo-expert generation and PEDR contribute to final performance (peak return: pseudo-expert only 4557, PEDR only 4907, combined SD2AIL 5743).

7. Discussion, Limitations, and Reproducibility

The principal insight is that diffusion models generate high-diversity, high-fidelity expert-like trajectories, thus addressing the limited support of small real expert datasets and giving the discriminator a more accurate reward boundary. PEDR further prioritizes difficult or uncertain samples, enhancing data efficiency. Limitations include increased wall-clock time due to diffusion sampling; the method remains amenable to acceleration with more efficient diffusion samplers. Empirical results are based on simulated environments; extension to real-world robotics and broader datasets remains open.

SD2AIL is fully reproducible: source code (PyTorch ≥1.10) is provided at https://github.com/positron-lpc/SD2AIL, compatible with MuJoCo and D4RL expert datasets. Standard training is invoked as python train_sd2ail.py --env Hopper --num_traj 1 --T 10 (Li et al., 21 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

SD2AIL: Adversarial Imitation Learning from Synthetic Demonstrations via Diffusion Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to SD2AIL.