Co-Evolving Policy Distillation (CoPD)

Updated 2 May 2026

The paper introduces a novel framework that enables expert policies to co-evolve via mutual on-policy distillation, achieving high behavioral overlap and efficient knowledge absorption.
It interleaves domain-specific reinforcement learning updates with cross-branch distillation to mitigate gradient interference and preserve complementary skills.
Empirically, CoPD outperforms static distillation and mixed RL approaches in multimodal tasks, yielding state-of-the-art results in text, image, and video integration.

Co-Evolving Policy Distillation (CoPD) defines a framework for consolidating multiple expert-trained policies into a unified, high-performing model by merging ongoing, domain-specialized reinforcement learning with simultaneous, bidirectional on-policy policy distillation. The central innovation is to allow all policies (“branches”) to co-evolve—serving alternately as mutual teacher and student—rather than employing sequential expert training followed by static distillation. This paradigm produces consistent behavioral patterns and achieves high-absorption of cross-domain competencies, yielding state-of-the-art results in multimodal and multi-domain integration (Gu et al., 29 Apr 2026). The core idea is also present in real-time policy distillation for deep reinforcement learning (Sun et al., 2019), which can be interpreted as a special case where teacher and student are updated in lockstep.

1. Theoretical Foundations and Motivation

Policy distillation addresses the challenge of transferring knowledge or compressing expertise from one or more “teacher” policies into a new “student” policy. In the RL context, standard approaches include:

Mixed Reinforcement Learning with Verifiable Rewards (RLVR): Training a single model on merged datasets (e.g., $D_1 \cup D_2$ ) using clipped-surrogate losses such as PPO or GRPO. While this allows simultaneous learning, gradients from disparate skills introduce a capability divergence cost $\Phi$ , reducing total utility: $U_{\rm mix}\approx X(D_1, D_2)-\Phi(D_1, D_2)$ .
On-Policy Distillation (OPD): First train separate expert policies to convergence, then distill into a student policy by minimizing KL divergence on the student’s own rollouts. However, if teacher and student behaviors differ significantly (low overlap $\mathcal{O}$ ), much of the teacher’s capability cannot be absorbed: $U_{\rm static}\approx \eta(\mathcal{O}_{\rm low}) X(D_1, D_2)$ , where $\eta(\mathcal{O}_{\rm low}) \ll 1$ .

CoPD interleaves concurrent domain-specific RLVR (“exploration”) and mutual on-policy distillation (“absorption”), ensuring each expert branch regularly incorporates complementary competencies while preserving behavioral proximity—yielding $\eta(\mathcal{O}_{\rm mod}) \approx 1$ and minimizing $\Phi$ . This achieves effective multi-tasking without gradient interference or knowledge loss (Gu et al., 29 Apr 2026).

2. Formulation and Objectives

Given $K$ expert branches $\{\pi_{\theta_k}\}$ (each initialized from a common base $\Phi$ 0 and associated with dataset $\Phi$ 1), CoPD alternates between:

Phase I (Domain-Specific RLVR):
- Each branch $\Phi$ 2 optimizes its own data via RLVR:
$\Phi$ 3 - Avoids cross-domain gradient conflict ( $\Phi$ 4 within each branch).
Phase II (Mutual On-Policy Distillation):
- On the other branch’s data ( $\Phi$ 5), compute token-level advantages for distillation using the teacher signal:
$\Phi$ 6 - Formulate the OPD surrogate:

$\Phi$ 7 - Each branch combines RLVR and OPD objectives:

$\Phi$ 8

This bidirectional, parallel update maintains high behavioral overlap ( $\Phi$ 9) and low symmetric KL, ensuring efficient cross-branch knowledge transfer while continuously extending each expert’s domain.

3. Algorithmic Implementation

For $U_{\rm mix}\approx X(D_1, D_2)-\Phi(D_1, D_2)$ 0 branches, the typical CoPD loop is:

$\mathcal{O}$ 2

Central scheduling hyperparameters include the ratio $U_{\rm mix}\approx X(D_1, D_2)-\Phi(D_1, D_2)$ 1 (typically $U_{\rm mix}\approx X(D_1, D_2)-\Phi(D_1, D_2)$ 2), rollout sampling temperature (usually $U_{\rm mix}\approx X(D_1, D_2)-\Phi(D_1, D_2)$ 3), and KL-clip threshold $U_{\rm mix}\approx X(D_1, D_2)-\Phi(D_1, D_2)$ 4. The batch size may be set, for example, to 256 prompts with 8 rollouts each. For more than two branches, a hub-and-spoke extension is straightforward (Gu et al., 29 Apr 2026).

For value-based RL (as in Atari games), this scheme manifests as real-time alternation of DQN-style updates for the teacher and combined distillation/self-learning updates for the student, with both sharing replay buffers and experience (Sun et al., 2019).

4. Empirical Evaluation

CoPD achieves state-of-the-art results in consolidating text, image, and video reasoning competencies.

Two‐Branch Setting (Image + Text):

Setting	Image Avg	Text Avg	Overall Avg
Base	54.00	55.78	54.74
Image-Expert	55.76	55.51	55.65
Text-Expert	54.88	57.89	56.13
Mixed RLVR	55.69	55.48	55.60†
OPD (V→T)	55.99	56.23	56.09
OPD (T→V)	56.44	56.09	56.29
CoPD	56.97	58.76	57.71

Three‐Branch Setting (Text + Image + Video):

Setting	Image Avg	Text Avg	Video Avg	Overall Avg
Base	54.00	55.78	56.22	55.11
MOPD	56.37	56.80	58.32†	56.99
CoPD	57.12	58.63	59.21	58.12

Ablation studies demonstrate that removing any bidirectional distillation path (e.g., image→text or text→image) degrades accuracy by approximately 0.7–1.0%. Each branch alone (prior to parameter merging) outperforms static OPD. During training, behavioral overlap ( $U_{\rm mix}\approx X(D_1, D_2)-\Phi(D_1, D_2)$ 5) is maintained above 0.90, and symmetric KL divergence remains low, in contrast to static OPD or mixed RLVR baselines, where $U_{\rm mix}\approx X(D_1, D_2)-\Phi(D_1, D_2)$ 6 decreases and KL grows by an order of magnitude (Gu et al., 29 Apr 2026).

In deep RL domains, real-time policy distillation achieves high compression (up to $U_{\rm mix}\approx X(D_1, D_2)-\Phi(D_1, D_2)$ 7 of teacher network parameters) with student networks matching or exceeding teacher performance, and reduces distillation time by approximately 50%, relative to sequential teacher–student training (Sun et al., 2019).

5. Comparative Analysis and Limitations

In mixed RLVR, concurrent training on multiple domains leads to destructive gradient interference ( $U_{\rm mix}\approx X(D_1, D_2)-\Phi(D_1, D_2)$ 8), causing capability blending and reduced overall utility. In contrast, static OPD, although free from gradient conflict, exhibits failings due to low behavioral overlap ( $U_{\rm mix}\approx X(D_1, D_2)-\Phi(D_1, D_2)$ 9) between converged experts, making absorption of domain knowledge inefficient.

CoPD, by interleaving per-domain RLVR and mutual OPD, preserves gradient orthogonality during skill acquisition while maximizing behavioral overlap for high absorption of complementary knowledge. It thus realizes:

$\mathcal{O}$ 0

Remains to be explored are the effects of scaling to many branches ( $\mathcal{O}$ 1) and sophisticated merging strategies, such as lottery-ticket ensembles or LoRA fusion, for parameter integration (Gu et al., 29 Apr 2026).

6. Broader Implications and Prospects

CoPD demonstrates a novel paradigm for unifying multiple expert competences via parallel, peer-to-peer policy distillation. This approach suggests a scalable model-parallel training regime orthogonal to scaling by parameter count or data volume. Its ability to surpass both domain specialists and standard consolidation baselines indicates potential for creating all-in-one agents across multimodal and multifaceted intelligence domains.

A plausible implication is that future architectures could extend to diverse modalities—language, vision, code, planning, dialogue—and dynamic ensemble merging, leveraging the mutual co-evolution principle to build robust, generalist models. Real-time, on-device deployment, and sample-efficient transfer during policy compression (Sun et al., 2019) are immediate practical impacts. Research into adaptive scheduling, curriculum learning across expert branches, and model fusion techniques could further capitalize on the foundational principles established by CoPD (Gu et al., 29 Apr 2026).

Markdown Report Issue Upgrade to Chat

References (2)

Co-Evolving Policy Distillation (2026)

Real-time Policy Distillation in Deep Reinforcement Learning (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Co-Evolving Policy Distillation (CoPD).

Co-Evolving Policy Distillation (CoPD)

1. Theoretical Foundations and Motivation

2. Formulation and Objectives

3. Algorithmic Implementation

4. Empirical Evaluation

5. Comparative Analysis and Limitations

6. Broader Implications and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Co-Evolving Policy Distillation (CoPD)

1. Theoretical Foundations and Motivation

2. Formulation and Objectives

3. Algorithmic Implementation

4. Empirical Evaluation

5. Comparative Analysis and Limitations

6. Broader Implications and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research