DPH-RL: Diversity-Preserving Hybrid Reinforcement Learning

Updated 9 April 2026

DPH-RL is a framework that integrates hybrid policy mechanisms with divergence regularization to maintain high-performing yet diverse reinforcement learning strategies.
It mitigates diversity collapse by combining on-policy and off-policy methods with replay buffers and mass-covering divergence measures.
Empirical results show that DPH-RL enhances exploration and robustness across continuous control and large language model reasoning tasks.

Diversity-Preserving Hybrid Reinforcement Learning (DPH-RL) encompasses a family of algorithms and frameworks designed to maximize the return of reinforcement learning (RL) agents while explicitly preserving the diversity of explored policies, solution trajectories, or behavior patterns. These methods arise in response to the persistent tendency of standard RL and RL with verifiable reward (RLVR) to exhibit diversity collapse—where entropy diminishes, policies overly concentrate on high-reward modes, and exploration potential is rapidly lost. DPH-RL frameworks integrate hybrid policy or reward mechanisms—often spanning both on-policy and off-policy regimes and leveraging mass-covering divergences, replay buffers, or hybridized state and behavioral representations—to sustain exploration, mitigate mode collapse, and improve success rates across single- and multi-attempt metrics.

1. Formal Problem Definition and Motivation

DPH-RL is fundamentally a multi-objective approach to RL that aims to maximize both expected return and the effective diversity of policy behaviors. Specifically, given a Markov Decision Process (MDP) $\mathcal{M}=(S,A,P,R)$ and episodic rollouts $\tau=(s_0,a_0,\ldots,s_T,a_T)$ , DPH-RL frameworks introduce:

Quality Objective: $F(\theta) = J(\pi_\theta) = \mathbb{E}_{\tau\sim\pi_\theta}[\sum_{t=0}^T r(s_t,a_t)]$ for policy $\pi_\theta$ .
Diversity Objective: A behavioral descriptor $d: \tau \rightarrow \mathbb{R}^D$ (e.g., foot-contact frequencies, solution semantic variance, or state visitation entropy) with the goal of spanning the descriptor space by optimizing $\{ \theta_i \}$ such that the set $\{ d(\theta_i) \}$ covers as many "niches" or solution modes as possible (Lim et al., 2023, Kang et al., 2 Feb 2026).

Motivation arises from empirical findings that classic RL with mode-seeking regularizers (notably the reverse KL divergence) rapidly narrows policy support, leading to catastrophic forgetting and degradation on multi-attempt (pass@ $k$ ) metrics despite single-attempt (pass@$1$) improvements (Li et al., 9 Sep 2025). DPH-RL counters this collapse by sustaining or actively encouraging the retention/rehearsal of multiple solution clusters throughout training.

2. Core Methodologies in DPH-RL

DPH-RL employs a range of architectural innovations and hybridization strategies, grouped into several mechanisms:

A. Hybrid Policy and Latent-Text Integration

Recent frameworks (e.g., LaDi-RL) explicitly couple a discrete token-space MDP with a continuous latent-space MDP. The latent policy $\pi_{\text{latent}}(a_t^\text{lat}|s_t^\text{lat};\phi)$ is implemented as a latent diffusion process where trajectories are generated via guided reverse diffusion—parameterized by neural networks—mapping initial semantic noise $\tau=(s_0,a_0,\ldots,s_T,a_T)$ 0 to denoised reasoning embeddings $\tau=(s_0,a_0,\ldots,s_T,a_T)$ 1. The decoded outputs $\tau=(s_0,a_0,\ldots,s_T,a_T)$ 2 are further optimized by a token-level text policy $\tau=(s_0,a_0,\ldots,s_T,a_T)$ 3 (Kang et al., 2 Feb 2026).

B. Replay Buffer and Off-Policy Diversity Anchoring

Time-sensitive replay buffers store recent, verifiably correct trajectories and serve as a reference to impose mass-covering divergences (e.g., Jensen–Shannon) between the current and historical policies, providing a continual diversity anchor (Li et al., 17 Mar 2026). The dynamic buffer size $\tau=(s_0,a_0,\ldots,s_T,a_T)$ 4 decreases over time to synchronize with decreasing policy entropy, while admission criteria ensure only high-confidence, reward-verifiable samples are stored.

C. Mass-Covering Divergence Regularization

DPH-RL generalizes classic mode-seeking reverse KL regularization to include mass-covering $\tau=(s_0,a_0,\ldots,s_T,a_T)$ 5-divergences—forward KL, Jensen–Shannon (JS), or $\tau=(s_0,a_0,\ldots,s_T,a_T)$ 6-divergences—which penalize the model for reducing support on any solution acquired by the initial or reference policy. The divergence penalty acts as a rehearsal mechanism, maintaining multi-modal support and directly counteracting entropy collapse (Li et al., 9 Sep 2025).

D. Hybrid Intrinsic Reward Fusion

In settings with sparse or hard-exploration rewards, hybrid intrinsic reward models (e.g., HIRE) combine multiple intrinsic signals—such as curiosity, episodic pseudo-counts, local entropy, and elliptical novelty—using fusion functions (summation, product, cycle, maximum). These amplify coverage of state or feature space, modulate exploration, and yield robust diversity enhancement (Yuan et al., 22 Jan 2025).

3. Algorithmic Frameworks and Implementation Details

Latent Diffusion-Based DPH-RL (LaDi-RL)

LaDi-RL alternates between:

Sampling $\tau=(s_0,a_0,\ldots,s_T,a_T)$ 7 latent diffusion trajectories per query via $\tau=(s_0,a_0,\ldots,s_T,a_T)$ 8-step reverse diffusion, guided by diversity repulsion terms that push latent samples apart (with time-decaying strength $\tau=(s_0,a_0,\ldots,s_T,a_T)$ 9).
Decoding each trajectory to $F(\theta) = J(\pi_\theta) = \mathbb{E}_{\tau\sim\pi_\theta}[\sum_{t=0}^T r(s_t,a_t)]$ 0 candidate outputs, computing rewards, and normalizing advantages both at the latent-trajectory and token levels.
Updating latent and text policies via separate GRPO objectives, combining them with weight $F(\theta) = J(\pi_\theta) = \mathbb{E}_{\tau\sim\pi_\theta}[\sum_{t=0}^T r(s_t,a_t)]$ 1.

The reverse diffusion policy $F(\theta) = J(\pi_\theta) = \mathbb{E}_{\tau\sim\pi_\theta}[\sum_{t=0}^T r(s_t,a_t)]$ 2 is optimized via a score-matching loss during pretraining, and the RL phase uses clipped group advantages and importance ratios for stability. Diversity is empirically maintained by stochastic multi-step denoising and explicit pairwise repulsion in latent space (Kang et al., 2 Feb 2026).

Replay-Buffer DPH-RL with Jensen–Shannon Regularization

The buffer collects only reward-verified trajectories, updated in a FIFO manner, and decays in size over time. The JS-regularizer is estimated efficiently per-sample using importance ratios:

$F(\theta) = J(\pi_\theta) = \mathbb{E}_{\tau\sim\pi_\theta}[\sum_{t=0}^T r(s_t,a_t)]$ 3

$F(\theta) = J(\pi_\theta) = \mathbb{E}_{\tau\sim\pi_\theta}[\sum_{t=0}^T r(s_t,a_t)]$ 4

Training alternates on-policy GRPO steps and off-policy diversity-anchoring steps, imposing the JS-penalty relative to the buffer-induced reference distribution (Li et al., 17 Mar 2026).

Quality-Diversity Actor-Critic Hybrids

Actor-critic architectures operate atop a discrete archive that tracks behavioral diversity via descriptors (e.g., contact rates for locomotion). Genetic and policy-gradient offspring are evaluated, inserted based on fitness and novelty, and the critic/actor are trained offline on transitions. Stochastic policy variants (SAC, DroQ) enable further diversity amplification by maximizing entropy or employing dropout regularization (Lim et al., 2023).

Hybrid Intrinsic Reward Models

Hybrid intrinsic signals are fused per time step via:

Summation: $F(\theta) = J(\pi_\theta) = \mathbb{E}_{\tau\sim\pi_\theta}[\sum_{t=0}^T r(s_t,a_t)]$ 5 (RMS-normalized)
Product: $F(\theta) = J(\pi_\theta) = \mathbb{E}_{\tau\sim\pi_\theta}[\sum_{t=0}^T r(s_t,a_t)]$ 6
Cycle: $F(\theta) = J(\pi_\theta) = \mathbb{E}_{\tau\sim\pi_\theta}[\sum_{t=0}^T r(s_t,a_t)]$ 7, periodically rotating among sources
Maximum: $F(\theta) = J(\pi_\theta) = \mathbb{E}_{\tau\sim\pi_\theta}[\sum_{t=0}^T r(s_t,a_t)]$ 8

This combination ensures that global and local novelty, surprise, and entropy bonuses jointly modulate exploration, supporting both broad and deep behavioral coverage (Yuan et al., 22 Jan 2025).

4. Diversity Preservation Mechanisms

Diversity is preserved through:

Multi-Modal Latent Stochasticity: Multi-step denoising in latent space repeatedly injects independent stochasticity, capturing multiple posterior solution modes without suppressing minority paths (Kang et al., 2 Feb 2026).
Repulsion-Based Guidance: Latent trajectories incurring proximity repel each other with a force decayed toward zero over denoising steps, ensuring coexisting solutions are not collapsed prematurely.
Archive/Niche-Driven Evaluation: Behavioral descriptors and archive insertion criteria in QD-RL encourage sustained occupation of diverse behavioral regions (Lim et al., 2023).
Divergence Penalties: Forward-KL and JS-divergence penalize loss of support on previously successful trajectories or tokens, in contrast with reverse-KL, which only penalizes mode-shifting away from the current solution peaks (Li et al., 9 Sep 2025).
Adaptive Replay: Only high-reward, temporally recent trajectories are replayed for diversity reference, balancing retention with policy evolution (Li et al., 17 Mar 2026).
Hybrid Intrinsic Signals: Multiple orthogonal novelty metrics (curiosity, episodic rarity, entropy, outlier detection) are combined to drive uncorrelated forms of exploration (Yuan et al., 22 Jan 2025).

5. Empirical Results and Benchmarks

DPH-RL frameworks have achieved substantial improvements in both canonical RL contexts (e.g., continuous control, hard-exploration environments) and LLM-based reasoning settings:

Benchmark	Baseline Pass@1 / Pass@k	DPH-RL Variant	Pass@1 / Pass@k Improvement
Qwen3-8B Code Gen.	60.58% / N/A	LaDi-RL	77.08% (+16.5%) / up to +12.8%
Math Reasoning (Qwen)	35.77% / N/A	LaDi-RL	43.45% (+7.68%)
SQL (Llama3-8B, BIRD)	59.4 / 68.4 (Pass@1/16)	Dynamic-buffer DPH-RL (JS)	62.7 / 72.9 (+3.3% / +4.5%)
Math-Long (Qwen3-4B)	29.8% mean@256	Dynamic-buffer DPH-RL (JS)	34.1% (+4.3%)

Empirical diagnostics include sustained reward variance, elevated Rank– $F(\theta) = J(\pi_\theta) = \mathbb{E}_{\tau\sim\pi_\theta}[\sum_{t=0}^T r(s_t,a_t)]$ 9 token support (e.g., Rank-1 probability stabilized at 80–85% under JS regularization vs. >90% for vanilla RL), QD-score, and archive coverage. Ablations consistently show that hybridization—whether in policy, divergence, or intrinsic reward space—substantially outperforms single-mode approaches and is robust to parameter scaling (Kang et al., 2 Feb 2026, Li et al., 17 Mar 2026, Li et al., 9 Sep 2025, Yuan et al., 22 Jan 2025, Lim et al., 2023).

6. Practical Considerations and Hyperparameter Guidelines

Divergence weight: For forward-KL/α-divergence, $\pi_\theta$ 0 is empirically stable; $\pi_\theta$ 1 for JS. Excessive regularization stifles exploration; too little effect reverts to classic RLVR behavior (Li et al., 9 Sep 2025).
Buffer size and decay: Early training benefits from larger replay windows ( $\pi_\theta$ 2), shrunk to $\pi_\theta$ 3 as entropy decays (Li et al., 17 Mar 2026).
Batch composition: Apply an 8:1 ratio of exploration (no divergence) to perfect (divergence-regularized) samples for mixed-batch RL.
Policy architecture and critic training: Sufficiently large numbers of offline gradient steps (e.g., $\pi_\theta$ 4) dramatically improve the effectiveness of off-policy/quality-diversity hybrids (Lim et al., 2023).
Intrinsic signals: Using 2–3 carefully selected metrics (e.g., NGU + RE3) and the cycle fusion strategy yields balanced computational efficiency and robustness (Yuan et al., 22 Jan 2025).

7. Theoretical and Empirical Limitations

Not all DRL advances transfer directly to the DPH-RL/QD-RL context (e.g., actor/critic dropout brings limited benefits when replay is highly diverse) (Lim et al., 2023). Excessive or poorly weighted divergence regularization can limit adaptive exploration. Archive management and behavioral descriptor design introduce trade-offs between granularity and computational cost.

DPH-RL frameworks demonstrably outperform classic RL, pure antonym RLVR, and previous replay-based methods in both diversity metrics and final task performance, establishing mass-covering divergence and hybrid replay/buffered architectures as state-of-the-art for diversity preservation in RL and RLVR contexts (Kang et al., 2 Feb 2026, Li et al., 17 Mar 2026, Li et al., 9 Sep 2025).