SD2AIL: Synthetic Demonstrations for AIL
- SD2AIL is an adversarial imitation learning framework that employs diffusion models to synthesize expert-like trajectories for robust reward inference and policy optimization.
- The framework integrates pseudo-expert generation with prioritized expert demonstration replay, effectively augmenting scarce expert datasets and improving sample efficiency.
- Empirical results on MuJoCo tasks demonstrate that SD2AIL outperforms baselines, achieving higher stability and performance even in challenging low-data regimes.
SD2AIL (“Synthetic Demonstrations to Adversarial Imitation Learning”) is an adversarial imitation learning (AIL) framework that leverages diffusion models to generate synthetic, expert-like demonstrations for reward inference and policy optimization. SD2AIL addresses the challenge of limited expert trajectory data by augmenting small expert datasets with high-quality synthetic samples (pseudo-experts) generated via a conditional denoising diffusion probabilistic model, thus improving AIL performance and stability even in low-data regimes. This methodology is integrated into a discriminator’s learning process and further facilitated by a prioritized expert demonstration replay (PEDR) strategy, enabling scalable and robust imitation learning from sparse demonstrations (Li et al., 21 Dec 2025).
1. Background and Motivation
AIL achieves policy learning by training a discriminator, , to distinguish between expert and agent-generated (policy) trajectories, while the generator policy seeks to fool the discriminator, as in Generative Adversarial Imitation Learning (GAIL). AIL methods typically require many high-quality expert trajectories for reliable reward inference and stable agent training. However, expert data acquisition is often costly in practical settings.
Previous works have introduced diffusion models in AIL for denoising representation learning or loss refinement (notably DiffAIL and DRAIL) but have not utilized the generative capacity of diffusion models to synthesize new expert-like trajectories for direct augmentation of the expert dataset. SD2AIL introduces diffusion-based data synthesis as a core primitive to address expert data scarcity, enabling more effective and sample-efficient adversarial imitation learning.
2. Model Structure and Training Objectives
The SD2AIL algorithm comprises three central modules: (1) diffusion-enhanced discriminator , (2) agent policy learned using Soft Actor-Critic (SAC), and (3) replay buffers for real expert () and pseudo-expert () samples.
2.1 Notation
- : state and action spaces
- : real expert policy / dataset
- : pseudo-expert policy (diffusion-generated, filtered)
- : agent policy with parameters
- : discriminator output for input with parameters
- : total diffusion steps
- : diffusion variances and cumulative products
- : neural network predicting diffusion noise
- : confidence threshold for pseudo-expert filtering
- : mini-batch size of real expert samples
- : pseudo:real sample ratio
2.2 Diffusion Model Loss
The forward diffusion adds noise at each step:
The reverse process is parameterized as:
The loss for diffusion training is:
2.3 Diffusion-Enhanced Discriminator
The discriminator integrates the diffusion loss as a confidence score: The surrogate reward for reinforcement learning is:
2.4 Adversarial Objective
The discriminator is trained to output high confidence on both real and pseudo experts and low on agent policy data:
3. Synthetic Demonstration Generation and Filtering
3.1 Reverse Diffusion Sampling
After each discriminator update, pseudo-expert samples are generated by a backward diffusion chain: yielding trajectory samples .
3.2 Dynamic Confidence-Based Filtering
Samples are admitted to the pseudo-expert buffer only if their discriminator confidence exceeds a dynamic threshold: This enforces high quality, especially as agent learning progresses.
4. Prioritized Expert Demonstration Replay (PEDR)
PEDR enhances sample efficiency and diversity by prioritizing expert samples (real and pseudo) by their information content as measured by discriminator uncertainty.
- For each (pseudo-)expert , define error and priority .
- Sampling probability:
- Importance weighting:
- Discriminator loss with PEDR:
5. Algorithm Details and Implementation
- Diffusion steps: with linear scheduler
- Mini-batch: pseudo:real expert ratio $7:1$ ()
- Replay buffers: per-trajectory for experts; pooled for pseudo-experts
- Networks:
- Policy : MLP (256 units × 2 layers, ReLU, Gaussian action heads)
- Discriminator : UNet-style encoder + MLP binary classifier
- Diffusion noise predictor : shares UNet backbone
- Optimizer: Adam, learning rate
- PEDR parameters: , annealed
- Hardware: 3× NVIDIA RTX A6000 GPUs
Pseudo-code (condensed):
- Collect agent transitions via
- Sample real + $7k$ pseudo-expert samples via PEDR
- Compute diffusion and discriminator losses
- Update via gradients
- Update PEDR priorities, sample new pseudo-experts, filter by confidence, add to buffer
- Compute rewards, update policy via SAC
6. Empirical Results and Analysis
Experiments were conducted on four MuJoCo continuous control tasks (Ant, Hopper, Walker2d, HalfCheetah) with expert datasets of 40 trajectories × 1000 steps, considering low-data settings with 1, 4, or 16 expert trajectories. SD2AIL outperformed baselines such as BC, GAIL, DiffAIL, DRAIL, and SMILING, particularly in 1-trajectory regimes:
| Task | Expert | DiffAIL | DRAIL | SMILING | SD2AIL |
|---|---|---|---|---|---|
| Ant | 4228 | 4901 | 5032 | 4785 | 5345 |
| Hopper | 3402 | 3275 | 3189 | 3301 | 3441 |
| Walker2d | 5620 | 5250 | 5345 | 5180 | 5743 |
| HalfCheetah | 4663 | 5600 | 5720 | 5501 | 5885 |
Ablations showed optimal performance at diffusion steps, with Fréchet Distance between pseudo and real expert features reduced to 85.4 (compared to 304.7 for random policy) over training. Surrogate reward correlation with true reward achieved 93.0%, 90.1%, 92.3%, 85.2% for SD2AIL across tasks (exceeding DiffAIL). Component ablations on Walker2d established that both pseudo-expert generation and PEDR contribute to final performance (peak return: pseudo-expert only 4557, PEDR only 4907, combined SD2AIL 5743).
7. Discussion, Limitations, and Reproducibility
The principal insight is that diffusion models generate high-diversity, high-fidelity expert-like trajectories, thus addressing the limited support of small real expert datasets and giving the discriminator a more accurate reward boundary. PEDR further prioritizes difficult or uncertain samples, enhancing data efficiency. Limitations include increased wall-clock time due to diffusion sampling; the method remains amenable to acceleration with more efficient diffusion samplers. Empirical results are based on simulated environments; extension to real-world robotics and broader datasets remains open.
SD2AIL is fully reproducible: source code (PyTorch ≥1.10) is provided at https://github.com/positron-lpc/SD2AIL, compatible with MuJoCo and D4RL expert datasets. Standard training is invoked as python train_sd2ail.py --env Hopper --num_traj 1 --T 10 (Li et al., 21 Dec 2025).