RLDG: Distilled Generalist RL

Updated 27 February 2026

The paper introduces RLDG, which distills expert RL policies via imitation to enhance sample efficiency and zero-shot transfer.
RLDG employs a two-stage process: generating high-quality state-action trajectories from specialists and then distilling them into a generalist policy using supervised losses.
Empirical results demonstrate 20–40% success improvement in complex robotic tasks and faster convergence compared to pure generalist or RL approaches.

Reinforcement Learning Distilled Generalists (RLDG) refer to agents constructed by distilling the capabilities of task-specialised reinforcement learning (RL) policies or ensembles thereof into a single generalist policy via policy distillation. This paradigm is grounded in augmenting the generalisation properties of large foundation models or domain randomisation-trained specialists with the sample efficiency and optimality of expert RL controllers. RLDG approaches have found substantial success in complex continuous control, robotic manipulation, and zero-shot policy transfer, with demonstrable performance and data-efficiency gains over both pure generalist and pure RL approaches (Xu et al., 2024, Weltevrede et al., 22 May 2025, Jülg et al., 6 Mar 2025, Brosseit et al., 2021, Jia et al., 2022).

1. Foundations and Methodological Framework

The core RLDG methodology involves two principal stages: (1) generation of high-quality state-action trajectories via specialist RL agents and (2) supervised imitation (policy distillation) into a parameter-efficient or multi-task generalist policy. The RL specialists are typically trained to maximize the standard discounted return in an MDP,

$J(\pi_\phi) = \mathbb{E}_{s_0 \sim \rho_0,\, a_t \sim \pi_\phi,\, s_{t+1} \sim P} \left[ \sum_{t=0}^{T} \gamma^t R(s_t, a_t) \right]$

where $\mathcal{S}$ is the state space (often incorporating RGB vision and proprioception), $\mathcal{A}$ is the action space (e.g., Cartesian delta plus gripper), $P$ is the environment transition, $R$ the reward, $\rho_0$ the initial-state distribution, and $\gamma$ the discount factor.

After attaining specialist-level mastery (typically near-100% RL success), the RL expert policy is rolled out to collect a dataset $\mathcal{D}_{\rm RL}$ of demonstration trajectories. The generalist policy $\pi_\theta(a|s)$ is then fine-tuned via supervised loss, most commonly the cross-entropy: $\mathcal{L}_{\rm distill}(\theta) = -\mathbb{E}_{(s, a) \sim \mathcal{D}_{\rm RL}} \left[ \log \pi_\theta(a|s) \right]$ Variants for continuous-action spaces employ mean-squared error or diffusion-model losses.

RLDG is agnostic to the underlying RL algorithm (e.g., off-policy actor-critic, HIL-SERL, PPO, SAC), supporting both real-world and simulated tasks (Xu et al., 2024, Jülg et al., 6 Mar 2025). The distillation target may be (i) a foundation-model-based generalist (e.g., Octo, OpenVLA) for real-robot tasks, (ii) a compact student for efficient inference, or (iii) an ensemble of distilled policies to enhance generalisation (Weltevrede et al., 22 May 2025).

2. Theoretical Justification and Distillation Guarantees

RLDG leverages both theoretical and empirical evidence for generalisation and performance advantages. In settings where the MDP exhibits symmetry or context group structures, ensemble distillation can be theoretically justified via invariance bounds (Weltevrede et al., 22 May 2025). For an ensemble of $N$ distilled policies $\{\pi_{\theta_i}\}_{i=1}^N$ trained on maximally diverse data, the generalisation gap is controlled by both the “coverage error” $\kappa$ of the training contexts and the ensemble size: $J^{\pi^*}-J^{\hat\pi_\infty} \le \frac{L_R}{(1-\gamma)(1-\gamma L_T(1+L_{\hat\pi_\infty}))} \left( \kappa \bar C_\Theta + \frac{1}{\sqrt{N}} \bar C_\Sigma(\epsilon) \right)$ where $L_R, L_T, L_{\hat\pi_\infty}$ are Lipschitz constants, $\bar C_\Theta, \bar C_\Sigma$ network- and data-dependent, and the probability-of-failure $\epsilon$ governs the Monte Carlo deviation.

Empirical ablations confirm that maximising dataset and domain diversity in the specialist RL phase and using sufficiently wide or ensemble students yield superior robustness and zero-shot transfer (Weltevrede et al., 22 May 2025, Xu et al., 2024). Table-based and context-augmented generalists further benefit from subgroup augmentation or domain-randomised policies (Brosseit et al., 2021, Zhao et al., 2020). Catastrophic forgetting and bias are mitigated by periodically interleaving pretraining data or employing data selection strategies.

3. Instantiations: Robotics and Complex Control

In robotic manipulation, RLDG approaches have demonstrated pronounced improvement over imitation-based or pure-foundation model baselines. For example, in real-world connector insertion, peg assembly, and pick-and-place tasks, RLDG policies distilled from RL datasets consistently yielded 20–40 percentage-point success improvements and strong zero-shot transfer relative to models trained on human teleoperator data (Xu et al., 2024). In robotics, actions are often encoded as 6D Cartesian delta plus binary gripper control, with states comprising visual and proprioceptive signals at 4–10Hz control rates. The foundation models distilled include 7B-parameter vision-language-action architectures, with fine-tuning accomplished via Low-Rank Adaptation (LoRA) or full-diffusion heads.

In continual RL and domain transfer, RLDG has been applied to sequential distillation of task-specific policies, employing state representation learning to avoid catastrophic forgetting and permit robust sim-to-real transfer (Traoré et al., 2019, Brosseit et al., 2021). DiDoR (Distilled Domain Randomization) concretely instantiates RLDG by first training $N$ expert policies under $N$ sampled physics parameters, then distilling their behaviour into a single student network that achieves zero-shot transfer to new domains or real robots (Brosseit et al., 2021).

4. Ensemble and Multi-Teacher Distillation

Ensemble-based RLDG extends the distillation paradigm by independently training multiple student policies (using the same or augmented distillation data) and aggregating their outputs, typically via arithmetic mean or voting (Weltevrede et al., 22 May 2025). Empirical studies show that increasing the ensemble size ( $N\approx 10$ ) materially improves test performance and generalisation in tasks such as rotationally symmetric reaching and grid navigation. Furthermore, including diverse “explore-go” rollout states in the distillation set (rather than only on-policy RL or teacher states) enhances robustness, with ensemble-distilled policies outperforming both the original RL teacher and single-policy behaviour cloning.

Peer-to-peer and multi-domain variants (e.g., P2PDRL) maintain separate RL learners across domain-randomised environments, interchanging action distributions via KL-regularisation in a fully decentralised fashion (Zhao et al., 2020). The symmetric distillation loss aligns each learner’s policy distribution across peers, stabilizing generalisation to previously unseen domains.

RLDG Variant	Distillation Target	Underlying RL	Key Benefit
Ensemble RLDG	Policy Ensemble	PPO/SAC/Experts	Generalisation, Robustness
Foundation Model RLDG	Large VLA	HIL-SERL, PPO	Sample efficiency, Zero-shot Transfer
P2PDRL	Peer Set	Parallel PPO	Domain-invariant policies
DiDoR	Single Student	PPO/domain-spec.	Sim-to-real Transfer

5. Empirical Evaluations and Performance Metrics

Empirical studies across real-robot and simulated manipulation benchmarks provide quantitative evidence for RLDG's advantages. In robotic fine manipulation, RL-distilled generalists (OpenVLA, Octo) reach up to 100% zero-shot success on connector insertion and assembly, requiring up to 6–10× less data versus human demonstration-based generalists (Xu et al., 2024). Cycle times improve by 0.3–2.3s per episode, inheriting the RL experts' implicit speed optimisations. Ablation studies reveal that the principal gains originate from RL-generated action labels; mixing human state distributions with RL actions closes most of the performance gap, indicating that RL provides more optimal action targets than human supervisors.

In simulation-based multi-task and domain transfer tasks (e.g., ManiSkill2), RLDG consistently accelerates convergence (80–100% SR in $1$–$2$M steps vs. $\geq 3$ M for vanilla PPO). On sparse-reward tasks, RLDG via refined policy distillation (RPD) achieves $60$–$90$% SR, whereas the underlying generalist or PPO fails to solve the task (Jülg et al., 6 Mar 2025). Ensemble and multi-variant distilled policies yield superior returns in both symmetry-augmented and arbitrary environment splits (Weltevrede et al., 22 May 2025, Brosseit et al., 2021, Zhao et al., 2020).

6. Limitations, Failure Modes, and Open Questions

Known limitations of RLDG include dependence on the specification of task rewards—real-world tasks without easily-defined binary or shaped rewards require further research into automatic reward inference or curriculum generation (Xu et al., 2024). Speed-robustness trade-offs emerge when distillation inherits an RL expert's aggressive timing, occasionally resulting in brittle edge cases; future work may explore multi-objective RL or moderated distillation losses. Catastrophic forgetting can occur when generalists are fine-tuned solely on RL rollouts, especially for foundation models requiring broad semantic coverage; interleaving task-agnostic pretraining corpora with RL data is recommended.

Scalability to very long-horizon or hierarchically decomposed tasks remains a challenge. Hierarchical RLDG (distilling at multiple subtask levels) and dynamically scheduled guidance (e.g., decaying the imitation strength as the student surpasses the teacher) are proposed research directions (Jülg et al., 6 Mar 2025, Xu et al., 2024).

7. Integration with Generalist–Specialist and Continual Learning Frameworks

Related approaches, such as Generalist–Specialist Learning (GSL), instantiate a three-phase meta-algorithm: (1) initial generalist RL, (2) spawning and training of multiple context-specialist agents, (3) demonstration-augmented fine-tuning of the generalist via imitation or adversarial reward shaping (Jia et al., 2022). This design amalgamates feature-sharing, specialist-level optimality, and robust aggregation into a generalist that exceeds both monolithic RL and simple supervised imitation.

RLDG has also been adopted in continual and lifelong RL, where sequential tasks are mastered without explicit task identifiers. Policy distillation from individually trained policies into an updatable single policy offers substantial benefits over naïve sequential RL, with respect to retention of previously acquired skills and scalability to extended task sequences (Traoré et al., 2019).

References:

(Xu et al., 2024) RLDG: Robotic Generalist Policy Distillation via Reinforcement Learning
(Weltevrede et al., 22 May 2025) How Ensembles of Distilled Policies Improve Generalisation in Reinforcement Learning
(Jülg et al., 6 Mar 2025) Refined Policy Distillation: From VLA Generalists to RL Experts
(Brosseit et al., 2021) Distilled Domain Randomization
(Zhao et al., 2020) Robust Domain Randomised Reinforcement Learning through Peer-to-Peer Distillation
(Traoré et al., 2019) Continual Reinforcement Learning deployed in Real-life using Policy Distillation and Sim2Real Transfer
(Jia et al., 2022) Improving Policy Optimization with Generalist-Specialist Learning