Co-Evolving Policy Distillation (CoPD)
- The paper introduces a novel framework that enables expert policies to co-evolve via mutual on-policy distillation, achieving high behavioral overlap and efficient knowledge absorption.
- It interleaves domain-specific reinforcement learning updates with cross-branch distillation to mitigate gradient interference and preserve complementary skills.
- Empirically, CoPD outperforms static distillation and mixed RL approaches in multimodal tasks, yielding state-of-the-art results in text, image, and video integration.
Co-Evolving Policy Distillation (CoPD) defines a framework for consolidating multiple expert-trained policies into a unified, high-performing model by merging ongoing, domain-specialized reinforcement learning with simultaneous, bidirectional on-policy policy distillation. The central innovation is to allow all policies (“branches”) to co-evolve—serving alternately as mutual teacher and student—rather than employing sequential expert training followed by static distillation. This paradigm produces consistent behavioral patterns and achieves high-absorption of cross-domain competencies, yielding state-of-the-art results in multimodal and multi-domain integration (Gu et al., 29 Apr 2026). The core idea is also present in real-time policy distillation for deep reinforcement learning (Sun et al., 2019), which can be interpreted as a special case where teacher and student are updated in lockstep.
1. Theoretical Foundations and Motivation
Policy distillation addresses the challenge of transferring knowledge or compressing expertise from one or more “teacher” policies into a new “student” policy. In the RL context, standard approaches include:
- Mixed Reinforcement Learning with Verifiable Rewards (RLVR): Training a single model on merged datasets (e.g., ) using clipped-surrogate losses such as PPO or GRPO. While this allows simultaneous learning, gradients from disparate skills introduce a capability divergence cost , reducing total utility: .
- On-Policy Distillation (OPD): First train separate expert policies to convergence, then distill into a student policy by minimizing KL divergence on the student’s own rollouts. However, if teacher and student behaviors differ significantly (low overlap ), much of the teacher’s capability cannot be absorbed: , where .
CoPD interleaves concurrent domain-specific RLVR (“exploration”) and mutual on-policy distillation (“absorption”), ensuring each expert branch regularly incorporates complementary competencies while preserving behavioral proximity—yielding and minimizing . This achieves effective multi-tasking without gradient interference or knowledge loss (Gu et al., 29 Apr 2026).
2. Formulation and Objectives
Given expert branches (each initialized from a common base 0 and associated with dataset 1), CoPD alternates between:
- Phase I (Domain-Specific RLVR):
- Each branch 2 optimizes its own data via RLVR:
3 - Avoids cross-domain gradient conflict (4 within each branch).
- Phase II (Mutual On-Policy Distillation):
- On the other branch’s data (5), compute token-level advantages for distillation using the teacher signal:
6 - Formulate the OPD surrogate:
7 - Each branch combines RLVR and OPD objectives:
8
This bidirectional, parallel update maintains high behavioral overlap (9) and low symmetric KL, ensuring efficient cross-branch knowledge transfer while continuously extending each expert’s domain.
3. Algorithmic Implementation
For 0 branches, the typical CoPD loop is:
2
Central scheduling hyperparameters include the ratio 1 (typically 2), rollout sampling temperature (usually 3), and KL-clip threshold 4. The batch size may be set, for example, to 256 prompts with 8 rollouts each. For more than two branches, a hub-and-spoke extension is straightforward (Gu et al., 29 Apr 2026).
For value-based RL (as in Atari games), this scheme manifests as real-time alternation of DQN-style updates for the teacher and combined distillation/self-learning updates for the student, with both sharing replay buffers and experience (Sun et al., 2019).
4. Empirical Evaluation
CoPD achieves state-of-the-art results in consolidating text, image, and video reasoning competencies.
Two‐Branch Setting (Image + Text):
| Setting | Image Avg | Text Avg | Overall Avg |
|---|---|---|---|
| Base | 54.00 | 55.78 | 54.74 |
| Image-Expert | 55.76 | 55.51 | 55.65 |
| Text-Expert | 54.88 | 57.89 | 56.13 |
| Mixed RLVR | 55.69 | 55.48 | 55.60† |
| OPD (V→T) | 55.99 | 56.23 | 56.09 |
| OPD (T→V) | 56.44 | 56.09 | 56.29 |
| CoPD | 56.97 | 58.76 | 57.71 |
Three‐Branch Setting (Text + Image + Video):
| Setting | Image Avg | Text Avg | Video Avg | Overall Avg |
|---|---|---|---|---|
| Base | 54.00 | 55.78 | 56.22 | 55.11 |
| MOPD | 56.37 | 56.80 | 58.32† | 56.99 |
| CoPD | 57.12 | 58.63 | 59.21 | 58.12 |
Ablation studies demonstrate that removing any bidirectional distillation path (e.g., image→text or text→image) degrades accuracy by approximately 0.7–1.0%. Each branch alone (prior to parameter merging) outperforms static OPD. During training, behavioral overlap (5) is maintained above 0.90, and symmetric KL divergence remains low, in contrast to static OPD or mixed RLVR baselines, where 6 decreases and KL grows by an order of magnitude (Gu et al., 29 Apr 2026).
In deep RL domains, real-time policy distillation achieves high compression (up to 7 of teacher network parameters) with student networks matching or exceeding teacher performance, and reduces distillation time by approximately 50%, relative to sequential teacher–student training (Sun et al., 2019).
5. Comparative Analysis and Limitations
In mixed RLVR, concurrent training on multiple domains leads to destructive gradient interference (8), causing capability blending and reduced overall utility. In contrast, static OPD, although free from gradient conflict, exhibits failings due to low behavioral overlap (9) between converged experts, making absorption of domain knowledge inefficient.
CoPD, by interleaving per-domain RLVR and mutual OPD, preserves gradient orthogonality during skill acquisition while maximizing behavioral overlap for high absorption of complementary knowledge. It thus realizes:
0
Remains to be explored are the effects of scaling to many branches (1) and sophisticated merging strategies, such as lottery-ticket ensembles or LoRA fusion, for parameter integration (Gu et al., 29 Apr 2026).
6. Broader Implications and Prospects
CoPD demonstrates a novel paradigm for unifying multiple expert competences via parallel, peer-to-peer policy distillation. This approach suggests a scalable model-parallel training regime orthogonal to scaling by parameter count or data volume. Its ability to surpass both domain specialists and standard consolidation baselines indicates potential for creating all-in-one agents across multimodal and multifaceted intelligence domains.
A plausible implication is that future architectures could extend to diverse modalities—language, vision, code, planning, dialogue—and dynamic ensemble merging, leveraging the mutual co-evolution principle to build robust, generalist models. Real-time, on-device deployment, and sample-efficient transfer during policy compression (Sun et al., 2019) are immediate practical impacts. Research into adaptive scheduling, curriculum learning across expert branches, and model fusion techniques could further capitalize on the foundational principles established by CoPD (Gu et al., 29 Apr 2026).