Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cyclic Policy Distillation for Sim-to-Real RL

Updated 14 May 2026
  • Cyclic Policy Distillation is a reinforcement learning method that partitions wide-ranging domain randomization into overlapping sub-domains for targeted policy training.
  • It sequentially trains local policies in a cyclic manner, transferring knowledge between adjacent sub-domains to ensure stable updates and faster convergence.
  • By distilling local insights into a unified global policy, CPD achieves significant sample efficiency gains and robust zero-shot sim-to-real performance in diverse tasks.

Cyclic Policy Distillation (CPD) is a sample-efficient reinforcement learning (RL) methodology designed for robust sim-to-real policy transfer under wide domain randomization (DR). CPD addresses the challenges arising from instability and sample inefficiency inherent in standard DR-based training when the randomized parameter range is extensive. By systematically partitioning the parameter space, cycling training across local policies, and leveraging targeted distillation, CPD accelerates convergence and enhances zero-shot deployment performance, both in simulation and on real-robot tasks (2207.14561).

1. Motivation and Conceptual Framework

Zero-shot sim-to-real transfer aims to train a single policy in a simulated environment across a broad range of randomized physical and sensor parameters, ensuring immediate real-world effectiveness without further adaptation. Vanilla DR approaches expose a single agent to the entire variability of the environment but suffer from high sample complexity and unstable policy updates when the DR parameter space is large. CPD mitigates these issues by:

  • Partitioning the DR parameter range into a sequence of narrower, overlapping sub-domains.
  • Allocating a distinct local policy–value pair, {Ï€(n),Q(n)}\{\pi^{(n)}, Q^{(n)}\}, to each sub-domain.
  • Training each local agent in a cyclical sequence, beneficially transferring knowledge only from adjacent sub-domains.
  • Distilling the ensemble of specialized local policies into a unified global policy Ï€(g)\pi^{(g)} for robust zero-shot transfer.

This cycle ensures knowledge is shared between closely related domains, maintaining learning stability while promoting sample efficiency.

2. Partitioning the Randomized Parameter Space

Consider the full randomized domain parameter space,

Θ={ξ∈Rd:ξi∈[ℓi,ui] ∀i=1,…,d},\Theta = \{\xi \in \mathbb{R}^d : \xi_i \in [\ell_i, u_i] \ \forall i = 1, \ldots, d\},

where each ξi\xi_i encodes a physical or sensor parameter, such as gravity, mass, friction, or sensor bias. CPD selects one or more target parameters ξj\xi_j and splits their ranges into NN overlapping intervals, yielding sub-domains:

Θ=⋃n=1NΘn,Θn={ξ:ξj∈[ℓjn,ujn], ξ−j∈[ℓ−j,u−j]}.\Theta = \bigcup_{n=1}^{N} \Theta_n, \hspace{1cm} \Theta_n = \{\xi : \xi_j \in [\ell_j^n, u_j^n], \ \xi_{-j} \in [\ell_{-j}, u_{-j}]\}.

Here, ξ−j\xi_{-j} denotes all non-split parameters, which remain fully randomized. In practice, hyperparameter NN is selected based on cross-validation to balance sample efficiency and final policy performance (e.g., N=4N=4 for Pendulum, π(g)\pi^{(g)}0 for MuJoCo domains). Overlap and splitting structure can follow planes, grids, or edges, provided adjacent sub-domains remain proximal in parameter space, thereby facilitating stable knowledge transfer.

3. The CPD Algorithm: Local Policy Training and Cyclic Transitions

Each sub-domain π(g)\pi^{(g)}1 is assigned a policy–value network pair, π(g)\pi^{(g)}2, both randomly initialized. Training proceeds by alternating forward and backward cycles across the sub-domains (π(g)\pi^{(g)}3), with the following key mechanisms:

  • On entering a new sub-domain, Ï€(g)\pi^{(g)}4 is initialized from Ï€(g)\pi^{(g)}5 (forward) or Ï€(g)\pi^{(g)}6 (backward) to exploit policy continuity.
  • In each cycle, for a fixed number of episodes and time steps, transitions Ï€(g)\pi^{(g)}7 are collected under policy Ï€(g)\pi^{(g)}8 within Ï€(g)\pi^{(g)}9.
  • The critic Θ={ξ∈Rd:ξi∈[â„“i,ui] ∀i=1,…,d},\Theta = \{\xi \in \mathbb{R}^d : \xi_i \in [\ell_i, u_i] \ \forall i = 1, \ldots, d\},0 is updated using the standard RL loss (e.g., Soft Actor-Critic, SAC, critic loss).
  • A monotonic policy-improvement distillation term is incorporated into the actor update:

Θ={ξ∈Rd:ξi∈[ℓi,ui] ∀i=1,…,d},\Theta = \{\xi \in \mathbb{R}^d : \xi_i \in [\ell_i, u_i] \ \forall i = 1, \ldots, d\},1

where Θ={ξ∈Rd:ξi∈[ℓi,ui] ∀i=1,…,d},\Theta = \{\xi \in \mathbb{R}^d : \xi_i \in [\ell_i, u_i] \ \forall i = 1, \ldots, d\},2 indexes the adjacent sub-domain. The policy update objective is

Θ={ξ∈Rd:ξi∈[ℓi,ui] ∀i=1,…,d},\Theta = \{\xi \in \mathbb{R}^d : \xi_i \in [\ell_i, u_i] \ \forall i = 1, \ldots, d\},3

ensuring that knowledge transfer is guided by expected performance improvement, analogous to Conservative Policy Iteration (CPI). Cycling proceeds until convergence across all local policies.

4. Global Policy Distillation

Upon local policy convergence, a global policy Θ={ξ∈Rd:ξi∈[ℓi,ui] ∀i=1,…,d},\Theta = \{\xi \in \mathbb{R}^d : \xi_i \in [\ell_i, u_i] \ \forall i = 1, \ldots, d\},4 is learned via one-shot distillation. The parameter Θ={ξ∈Rd:ξi∈[ℓi,ui] ∀i=1,…,d},\Theta = \{\xi \in \mathbb{R}^d : \xi_i \in [\ell_i, u_i] \ \forall i = 1, \ldots, d\},5 is optimized to minimize the average KL divergence between Θ={ξ∈Rd:ξi∈[ℓi,ui] ∀i=1,…,d},\Theta = \{\xi \in \mathbb{R}^d : \xi_i \in [\ell_i, u_i] \ \forall i = 1, \ldots, d\},6 and each Θ={ξ∈Rd:ξi∈[ℓi,ui] ∀i=1,…,d},\Theta = \{\xi \in \mathbb{R}^d : \xi_i \in [\ell_i, u_i] \ \forall i = 1, \ldots, d\},7 on rollouts aggregated across all sub-domains:

Θ={ξ∈Rd:ξi∈[ℓi,ui] ∀i=1,…,d},\Theta = \{\xi \in \mathbb{R}^d : \xi_i \in [\ell_i, u_i] \ \forall i = 1, \ldots, d\},8

where Θ={ξ∈Rd:ξi∈[ℓi,ui] ∀i=1,…,d},\Theta = \{\xi \in \mathbb{R}^d : \xi_i \in [\ell_i, u_i] \ \forall i = 1, \ldots, d\},9 is a replay buffer containing transitions from all sub-domains. This global distillation aggregates specialized skills into a singular policy for real-world deployment.

5. RL Backbone, Architecture, and Implementation

CPD is built atop the Soft Actor-Critic (SAC) off-policy actor–critic methodology. In each sub-domain:

  • The critic ξi\xi_i0 is updated using mean-squared Bellman residual.
  • The actor ξi\xi_i1 employs a diagonal-Gaussian output ξi\xi_i2, trained by the SAC policy gradient, augmented by the CPD distillation term.
  • Both actor and critic networks include an LSTM (64 units) to encode state–action histories, thereby inferring hidden environment parameters.

Key hyperparameters, selected for training stability and efficiency, include ξi\xi_i3 (Pendulum) or ξi\xi_i4 (MuJoCo tasks), monotonic mix coefficient ξi\xi_i5, discount factor ξi\xi_i6, episodes per sub-domain shift ξi\xi_i7, steps per episode ξi\xi_i8, and SAC learning rates of ξi\xi_i9. For real-robot experiments, hidden layer sizes are increased to 128 units.

Application Sub-domain Count (ξj\xi_j0) LSTM Size Episodes/Shift Steps/Episode
Pendulum (Gym) 4 64 15 150
Mujoco (Pusher, Swimmer, HC) 6 64 15 150
UR3 Ball-Dispersal 6 128 15 150

6. Empirical Performance and Evaluation

CPD demonstrates substantial improvements in sample efficiency and zero-shot sim-to-real transfer. In four benchmark tasks—OpenAI Gym Pendulum and MuJoCo Pusher, Swimmer, and HalfCheetah—CPD outperforms SAC-DR, Peer-to-Peer DRL, Distilled DR, Divide-and-Conquer, and Active DR. Notable results include:

  • Pendulum: CPD achieves a total reward of 190 with approximately 18,000 samples, compared to ξj\xi_j1100,000 samples for the next-best approach, representing a %%%%42Θ=⋃n=1NΘn,Θn={ξ:ξj∈[â„“jn,ujn], ξ−j∈[ℓ−j,u−j]}.\Theta = \bigcup_{n=1}^{N} \Theta_n, \hspace{1cm} \Theta_n = \{\xi : \xi_j \in [\ell_j^n, u_j^n], \ \xi_{-j} \in [\ell_{-j}, u_{-j}]\}.43%%%% efficiency gain.
  • HalfCheetah: Only CPD manages to secure a positive reward within 150,000 steps; all baselines fail to do so.
  • Ablation experiments confirm that either removing the monotonic neighbor-distillation (ξj\xi_j4) or randomizing the order of cycling degrades both convergence speed and final policy performance.

In real-robot ball-dispersal, using a UR3 arm, CPD's global policy achieves 13/14 (juggling) and 11/14 (beads) upper ball teardown success in zero-shot tests, compared to SAC-DR's 6/14 and 1/14, respectively, and 0/14 for SAC without DR.

7. Domain Randomization Scope and Task Environments

CPD's effectiveness is demonstrated across both simulation and real-robot settings with wide-ranging DR:

  • Pendulum: gravity [0.7, 1.5], timestep [0.8, 1.2], mass/length [0.8, 1.2], actuator gain/bias.
  • MuJoCo suite: gravity, timestep, friction, mass, actuator gain/damping in [0.5, 2].
  • UR3 Ball-Dispersal: gravity [9, 11], friction [0.6, 1.0], ball mass [5, 20g], ball radius [25, 30mm], sensor bias ξj\xi_j510mm, Z-scale [0.7, 1.3]. Observations include a compressed depth image (5×9) and end-effector position.

CPD robustly adapts to both continuous control and vision-based robotic manipulation.


For code, pretrained models, and experimental artifacts, see https://github.com/yuki-kadokawa/cyclic-policy-distillation. The CPD framework provides a principled strategy to trade increased bookkeeping complexity for significant sample savings, stable learning, and enhanced zero-shot sim-to-real RL performance under extensive domain randomization (2207.14561).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cyclic Policy Distillation (CPD).