Papers
Topics
Authors
Recent
Search
2000 character limit reached

Random Expert Distillation (RED)

Updated 30 April 2026
  • The paper demonstrates that RED replaces adversarial methods by constructing a surrogate reward from support estimation of expert state-action pairs using random network distillation.
  • RED computes a prediction error as an inverse density estimator, providing a fixed reward signal that stabilizes policy learning under various RL algorithms.
  • Empirical results show that RED achieves expert-level performance with lower variance and reduced computational overhead compared to traditional IRL and GAIL techniques.

Random Expert Distillation (RED) is a framework for imitation learning which replaces adversarial/discriminator-based objectives with support estimation of the expert policy’s state-action distribution. The central principle is to construct a reward function by quantifying the proximity of any policy’s (state, action) pairs to the support of the expert’s demonstrated behavior, estimated via random network distillation. This reward is then used with any standard reinforcement learning (RL) algorithm to recover an expert-mimicking policy. RED operates with only a finite set of expert trajectories and without access to the underlying reward signal, offering improved stability and lower computational overhead versus inverse reinforcement learning (IRL) and adversarial methods such as GAIL (Wang et al., 2019).

1. Problem Setting and Motivation

The RED framework is situated within an infinite-horizon discounted Markov decision process (S,A,P,r,p0,γ)(S, A, P, r, p_0, \gamma), where only a batch of expert trajectories DE={τi}i=1ND_E = \{\tau_i\}_{i=1}^N is available and the reward function rr is unknown. The aim is to construct from DED_E a reward estimator r^(s,a)\hat{r}(s, a) such that maximizing discounted returns under this surrogate reward recovers policies matching the expert’s true performance.

Classical IRL approaches (e.g., MaxEnt IRL) require solving a bi-level optimization—alternating between cost parameter updates and RL inner loops—which is computationally expensive and indirect. Adversarial approaches (notably, GAIL) pose imitation as policy-distribution matching via a generative adversarial network, but suffer from training instabilities such as vanishing/exploding gradients and discriminator overfitting. Instead, RED targets the expert’s support in S×AS \times A, seeking high reward only for state-action pairs explained by the expert, thereby reframing imitation as support estimation followed by RL with a fixed, data-derived reward (Wang et al., 2019).

2. Theoretical Foundations: Support Estimation via Random Network Distillation

RED’s reward construction stems from a support-estimation operator implemented either as a kernel-PCA subspace projection or as random network distillation (RND).

  • Kernel-PCA formalism: Given a reproducing kernel Hilbert space mapping Ï•(x)\phi(x), the support of the expert’s policy induces a covariance CÏ€C_\pi; the projector PÏ€=CÏ€+CÏ€P_\pi = C_\pi^+ C_\pi and the squared distance ∥(I−PÏ€)Ï•(x)∥2\| (I - P_\pi)\phi(x) \|^2 act as a zero-when-on-support indicator. Empirically, this is approximated via truncation of the kernel eigendecomposition, yielding a score function for novel points.
  • Random Network Distillation (RND): RED instantiates two networks:
    • A fixed, randomly-initialized target DE={Ï„i}i=1ND_E = \{\tau_i\}_{i=1}^N0,
    • A predictor DE={Ï„i}i=1ND_E = \{\tau_i\}_{i=1}^N1 (identical architecture), trained on DE={Ï„i}i=1ND_E = \{\tau_i\}_{i=1}^N2 to minimize mean-square error between the predictor and the target outputs.

For any DE={τi}i=1ND_E = \{\tau_i\}_{i=1}^N3, the prediction error

DE={τi}i=1ND_E = \{\tau_i\}_{i=1}^N4

serves as an inverse density estimator: points frequently present in DE={τi}i=1ND_E = \{\tau_i\}_{i=1}^N5 yield small errors; out-of-support points generate large errors (Wang et al., 2019). The final reward is given by:

DE={τi}i=1ND_E = \{\tau_i\}_{i=1}^N6

where DE={τi}i=1ND_E = \{\tau_i\}_{i=1}^N7 is a tunable scale.

RED’s theoretical properties draw on established results for kernel-based support estimation: under separating kernels, the support estimator converges in Hausdorff distance as DE={τi}i=1ND_E = \{\tau_i\}_{i=1}^N8, and for RND, the regression converges toward the ideal projector as the class becomes rich and optimally fitted.

3. Algorithmic Formulation and Implementation

The RED procedure proceeds as follows:

  1. Initialization:
    • Randomly initialize and freeze the target net parameters DE={Ï„i}i=1ND_E = \{\tau_i\}_{i=1}^N9.
    • Instantiate the predictor net rr0.
  2. Predictor fitting:
  3. Reward construction:
    • For any rr4, compute rr5 and define rr6.
    • Optionally, set a terminal penalty for far-off-support episode ends.
  4. Policy learning:
    • Apply any off-the-shelf RL algorithm (e.g., TRPO, DQN, SVG) to maximize cumulative reward under rr7.

Typical architectures employ MLPs with 2–3 hidden layers of size 64–256 and ReLU activations; Adam optimizer with learning rate rr8 is standard. Once rr9 is computed, no further modification occurs—reward extraction is single-pass, in contrast to the alternating updates in IRL or GAIL (Wang et al., 2019).

4. Empirical Evaluation and Benchmark Comparisons

RED demonstrates efficacy across discrete and continuous domains:

  • Toy discrete MDPs: RED achieves rapid convergence to optimal episodic rewards with smaller data (DED_E0) compared to GAIL and GMMIL, which display instability or overfitting.
  • Mujoco continuous-control (Hopper, HalfCheetah, Walker2d, Reacher, Ant): With DED_E1 expert TRPO trajectories, RED matches or exceeds baselines (GAIL, GMMIL, AE) in final episodic returns with markedly lower variance—e.g., on Hopper: RED DED_E2; GAIL DED_E3; see table below.
Method Hopper HalfCheetah Reacher Walker2d Ant
GAIL 3614±7 4516±549 −32±40 4878±2848 3187±904
GMMIL 3309±26 3464±476 −12±5 2967±702 —
AE 3478±3 3381±102 −11±6 4098±118 3779±423
RED 3626±4 3072±85 −10±5 4481±21 3553±349
  • Autonomous driving (single human demo): RED with terminal penalty achieves average episode length of DED_E4 steps (track completion: DED_E5), outperforming GAIL (DED_E6), GMMIL (DED_E7), and BC (DED_E8).

Empirically, RED delivers competitive or superior policy performance, with training stability enhanced by the fixed reward formulation (Wang et al., 2019).

5. Complexity, Stability, and Scalability

RED’s computational advantage is derived from its single-pass reward extraction. Whereas kernel-PCA support estimation incurs DED_E9 cost, RND-based RED scales as r^(s,a)\hat{r}(s, a)0—r^(s,a)\hat{r}(s, a)1 being per-gradient step cost. Predictor training runs once (r^(s,a)\hat{r}(s, a)2 for r^(s,a)\hat{r}(s, a)3 epochs and batch size r^(s,a)\hat{r}(s, a)4); subsequent RL does not require reward retraining, in contrast to IRL or adversarial paradigms.

The fixed nature of r^(s,a)\hat{r}(s, a)5 eliminates adversarial oscillation and instability, but predictor overfitting is a risk—fully-converged predictors may assign negligible loss everywhere, flattening the reward. Appropriate network regularization and early stopping are practical mitigations (Wang et al., 2019).

Limitations:

  • RED only supports direct imitation; it does not recover the expert’s ground-truth cost function.
  • For highly stochastic experts whose support approaches the full r^(s,a)\hat{r}(s, a)6, the reward becomes uniform and uninformative. In this regime, behavior cloning (BC) with sufficient data is preferred.
  • Some tasks require BC-based initialization for adequate exploration.

Proposed extensions:

  • Pairing RED with an adversarial discriminator to incentivize broader exploration.
  • Employing alternative support estimators, e.g., denoising autoencoders or normalizing flows.
  • Meta-learning the reward decay parameter dynamically.
  • Extending to hierarchical policies by multi-scale support estimation (Wang et al., 2019).

Relation to Coupled Distributional RED (CDRED): Subsequent developments extend RED by coupling expert and behavioral density estimation using RND in the latent space of a world model, as in CDRED. This approach (CDRED) achieves further stability and performance improvements by jointly learning two RND predictors in latent space, balancing expert matching and exploration, and outperforming adversarial model-based methods on Meta-World, DMControl, and ManiSkill2 (Li et al., 4 May 2025).

7. Comparative Perspective and Impact

RED represents a principled alternative to IRL and GAN-based imitation frameworks by decoupling reward construction from adversarial training and inner-loop RL. It facilitates stable, efficient learning with fixed rewards and is compatible with a wide spectrum of RL algorithms and continuous-control domains. Later generalizations (e.g., CDRED) highlight RED’s adaptability as the core of density-based, latent-space imitation algorithms, yielding expert-level performance across high-dimensional benchmarks and proving robust under deep exploration and visually complex settings (Li et al., 4 May 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Random Expert Distillation (RED).