Papers
Topics
Authors
Recent
2000 character limit reached

Prioritized Expert Demonstration Replay (PEDR)

Updated 28 December 2025
  • PEDR is a data sampling strategy that prioritizes transitions with high discriminator or loss errors to improve policy updates.
  • It integrates importance-sampling corrections and maintains separate buffers for expert, pseudo-expert, and agent-generated data to mitigate sampling bias.
  • Empirical results show that PEDR accelerates convergence and enhances safety and efficiency in both adversarial imitation learning and RL-from-demonstrations.

Prioritized Expert Demonstration Replay (PEDR) is a class of data sampling strategies designed for reinforcement and imitation learning from expert demonstrations. PEDR selectively favors transitions with high learning utility—typically those that are most misclassified or associated with high loss—when constructing minibatches for policy or value function updates. Modern variants integrate importance-sampling (IS) corrections to mitigate bias and often maintain distinct buffers for different data types (real expert, pseudo-expert, agent-generated). PEDR has been instrumental in improving the data efficiency, stability, and final policy performance in both adversarial imitation learning and RL-from-demonstrations contexts (Li et al., 21 Dec 2025, Liu et al., 2021).

1. Core Mechanisms and Mathematical Formulation

PEDR assigns each transition ii in a prioritized buffer a scalar priority pip_i, derived to reflect recent learning progress. In SD2AIL (Li et al., 21 Dec 2025), which combines adversarial imitation learning (AIL) and synthetic data generation, the priority is taken as the magnitude of the discriminator “error” signal: δi=1Dϕ(si,ai,ϵ)\delta_i = 1 - D_\phi(s_i, a_i, \epsilon)

pi=δi+εp_i = |\delta_i| + \varepsilon

where Dϕ(s,a,ϵ)(0,1)D_\phi(s,a,\epsilon) \in (0,1) is the discriminator’s confidence that transition (s,a)(s,a) is expert-like, and ε>0\varepsilon>0 is a constant to avoid degenerate zero priorities.

Sampling probability is then proportional to the priority exponentiated: P(i)=piζkpkζP(i) = \frac{p_i^\zeta}{\sum_k p_k^\zeta} where ζ0\zeta \geq 0 modulates the skewness of prioritization.

To correct for the induced nonuniform sampling, IS weights are applied to each transition’s loss contribution: wi=(1N1P(i))ηw_i = \left( \frac{1}{N} \cdot \frac{1}{P(i)} \right)^\eta where NN is buffer size and η[0,1]\eta\in[0,1] is annealed from 0 to 1 across training.

In hybrid RLfD settings (Liu et al., 2021), agent-generated samples are assigned priority based on policy and value errors, while expert samples use imitation loss. Both buffers apply the same prioritization logic, with the constructed minibatch split between agent and expert buffers according to a dynamic schedule.

2. Algorithm Implementation and Workflow

A typical PEDR loop consists of the following steps:

  1. Priority Computation: After each training step, update each transition’s priority based on discriminator error (AIL) or a combination of policy and critic losses (RLfD).
  2. Prioritized Sampling: For each replay buffer (expert, pseudo-expert, agent), sample minibatch elements proportionally to piζp_i^\zeta.
  3. IS Weight Calculation: For each drawn element, compute IS weights based on current sampling probabilities and the IS exponent parameter.
  4. Loss Weighting and Model Update: Multiply each loss term by the corresponding IS weight for backpropagation.
  5. Buffer Maintenance: After model updates, recompute and store updated priorities.

In SD2AIL, buffers are maintained separately for each real expert trajectory to prevent trajectory dominance, and a shared buffer is used for synthetic (diffusion-generated) pseudo-expert data (Li et al., 21 Dec 2025). In RLfD such as (Liu et al., 2021), the agent and expert buffers are mixed adaptively based on agent progress relative to expert performance.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#--- PEDR: one discriminator‐training step ---
Input: R_e  # list of expert buffers
       R_pe # pseudo-expert buffer
       ζ, η # priorities/IS exponents
       k_e, k_pe # batch sizes

batch_e = []
for t in 1N_traj:
    batch_e += SamplePrioritized(R_e[t], k_e/N_traj, ζ)

batch_pe = SamplePrioritized(R_pe, k_pe, ζ)

for i in batch_e  batch_pe:
    δ_i = 1  D_φ(s_i, a_i, ε)
    p_i = |δ_i| + ε

for i in batch_e  batch_pe:
    P_i = p_i^ζ / sum(p^ζ)
    w_i = ((1/N_total) / P_i)^η

L = sum_{i}(w_i * BCE(D_φ(s_i, a_i, ε), y_i))
φ  φ  α_φ _φ L

3. Integration into Adversarial Imitation and RL Frameworks

PEDR is tightly coupled to training loops in both AIL and RLfD architectures. In SD2AIL (Li et al., 21 Dec 2025), the main loop integrates PEDR within discriminator training as follows:

  • Agent generates on-policy rollouts.
  • PEDR samples real and synthetic expert transitions for the discriminator update.
  • Diffusion-generated pseudo-expert transitions are periodically inserted to augment the synthetic buffer.
  • Surrogate rewards derived from the discriminator are used to update the policy via soft actor-critic (SAC).
  • PEDR’s prioritization ensures that samples making the largest discriminator errors are revisited most frequently, refining the reward estimator.

In RLfD for urban autonomous driving (Liu et al., 2021), PEDR manages the sampling from both agent and expert buffers, adaptively adjusting the proportion using performance-based criteria. The prioritized sampling focuses policy and value loss computation on salient transitions in both buffers with IS correction applied throughout.

4. Practical Considerations and Hyperparameter Selection

PEDR exposes several hyperparameters critical to its efficacy. Representative settings are as follows:

Parameter Description Typical Value
ζ\zeta / ω\omega Priority exponent (sampling skew) 0.5–1.0 (Li et al., 21 Dec 2025); 0.6 (Liu et al., 2021)
η\eta / β\beta IS exponent (bias correction) Annealed 0→1
ε\varepsilon Small priority constant 10610^{-6}
kpe:kek_{pe} : k_e Synthetic:real expert batch ratio 7:1 (tunable) (Li et al., 21 Dec 2025)
NtrajN_{traj} Real expert buffer count Matches expert trajectories
ρ\rho Agent-to-expert sampling ratio Adaptive ((Liu et al., 2021): init 0.3)

Hyperparameter annealing (especially for η\eta/β\beta) is implemented over early training to reduce initial variance and correct for sampling bias as the system converges.

5. Empirical Impact and Ablation Results

Ablation studies and empirical evaluations demonstrate the independent contribution of PEDR to policy performance, convergence speed, and robustness.

  • In SD2AIL’s Walker task (Li et al., 21 Dec 2025):
    • Uniform replay over real + pseudo data yields peak return ≈ 4557.
    • PEDR with only real expert data (no diffusion demos) yields ≈ 4907.
    • Full SD2AIL (PEDR + synthetic) achieves ≈ 5743 compared to a DiffAIL baseline at ≈ 4040.
    • PEDR accelerates convergence by 30–50K frames over uniform replay.
  • In urban driving with RLfD (Liu et al., 2021):
    • PEDR achieves 90% success, 8% collision, average return 1205, outperforming DQfD and SAC baselines across safety and efficiency metrics.
    • PEDR reaches its performance threshold in 40K steps versus 80K for SAC.

This suggests that prioritized mixing of expert and agent data (especially with synthetic augmentation) yields substantial gains in sample efficiency, safety, and final performance beyond both pure RL, pure IL, or simpler RLfD strategies.

6. Extensions and Research Context

PEDR extends Prioritized Experience Replay (PER) to the domain of expert demonstration learning, addressing the limitations posed by small expert datasets and the integration of synthetic demonstrations.

  • In (Li et al., 21 Dec 2025), PEDR is adapted for large pools of both real and synthetic demonstrations, with discriminator-based priority signals and IS correction for off-policy gradient updates embedded in a diffusion-enhanced AIL system.
  • In (Liu et al., 2021), PEDR is synchronized with SAC networks for continuous control, using distinct priority heuristics for self-generated versus expert transitions, and dynamically evolving sampling ratios based on agent-expert performance differentials.

A plausible implication is that such strategies might generalize to other domains requiring balanced integration of heterogeneous replay sources, or when judicious exploitation of high-value expert transitions accelerates learning.

PEDR remains an area of active research with implications for imitation learning, safe exploration, sample efficiency, and leveraging generative models for synthetic expert data.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Prioritized Expert Demonstration Replay Strategy (PEDR).