Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 43 tok/s Pro
GPT-5 Medium 28 tok/s
GPT-5 High 35 tok/s Pro
GPT-4o 94 tok/s
GPT OSS 120B 476 tok/s Pro
Kimi K2 190 tok/s Pro
2000 character limit reached

DARLING: Diversity-Aware Reinforcement Learning

Updated 3 September 2025
  • DARLING is a reinforcement learning paradigm that integrates explicit diversity signals with quality rewards to avoid mode collapse and ensure broad exploration.
  • It employs a learned semantic partition function to measure non-equivalence among responses, promoting genuine novelty over superficial variation.
  • Empirical evaluations in creative writing and competition mathematics show DARLING improves both solution quality and diversity compared to quality-only methods.

Diversity-Aware Reinforcement Learning (DARLING) is a comprehensive paradigm for reinforcement learning (RL) that systematically incorporates explicit diversity signals into the policy optimization process. In contrast to traditional RL pipelines, which primarily target reward maximization or adherence to “helpfulness” or “correctness” metrics, DARLING simultaneously reinforces both the quality and the semantic diversity of agent behaviors or outputs. This dual-objective approach is motivated by the observation that optimizing solely for reward or accuracy often leads to distributional collapse, reduction in solution variety, and inadequate exploration—especially in creative or problem-solving domains where the solution manifold is richly structured.

1. Definition and Motivation

DARLING is defined by the explicit inclusion of a diversity metric—beyond surface-level variations—into the reinforcement learning objective for policy updates. This stems from the necessity to counteract the tendency of RL-trained agents, including LLMs, to over-concentrate on high-reward or high-probability outputs, thus narrowing their range of behaviors and curtailing exploration. The core motivation for DARLING is to promote robust generalization, to prevent model collapse in multi-modal or creative spaces, and to enable effective discovery of novel or rare solutions in tasks with either ambiguous, non-verifiable objectives or highly-structured verifiable evaluation (e.g., competition math) (Li et al., 2 Sep 2025).

2. Semantic Diversity Signals and Partitioning

DARLING introduces semantic diversity measurement through a learned partition function that operates at the level of response meaning rather than lexical or n-gram variation. For a given prompt and a set of candidate responses Y={y1,...,yn}Y = \{y_1, ..., y_n\}, a classifier trained on human-annotated examples determines whether two outputs are semantically equivalent:

d(yi,yj)=1{yi and yj are not semantically equivalent}.d(y_i, y_j) = \mathbb{1}\{ y_i \text{ and } y_j \text{ are not semantically equivalent} \}.

The diversity score for a candidate yiy_i is then:

Div(yiY)=1n1jid(yi,yj).\text{Div}(y_i \mid Y) = \frac{1}{n-1} \sum_{j \neq i} d(y_i, y_j).

This quantity is normalized to [0,1][0,1] (via a normalization function Norm()\text{Norm}(\cdot)) to be compatible with the scale of the quality reward. As a result, diversity is measured as an expectation over semantic equivalence groupings, with each output contributing proportionally to the overall spread of meanings in YY. The learned nature of the classifier ensures that this process extends beyond superficial form to genuine semantic distinctions, supporting broad applicability across open-ended and highly-structured RL settings.

3. Joint Optimization of Quality and Diversity

DARLING combines the computed diversity score with a standard quality reward (obtained from a reward model, evaluator, or extrinsic signal) via a multiplicative fusion:

rDARLING(x,yiY)=r(x,yi)×Norm(Div(yiY))r_{\text{DARLING}}(x, y_i \mid Y) = r(x, y_i) \times \text{Norm}(\text{Div}(y_i \mid Y))

where r(x,yi)r(x, y_i) is the quality score (task-specific reward), and xx is the input. This design ensures that rewards are maximized only when both quality and diversity are present, and that quality-only improvements are insufficient unless accompanied by sufficient novelty relative to the remainder of YY.

For group-based RL methods such as Group Relative Policy Optimization (GRPO), the advantage of each response is adjusted accordingly:

A(yi)=rDARLING(x,yiY)1njrDARLING(x,yjY).A(y_i) = r_{\text{DARLING}}(x, y_i \mid Y) - \frac{1}{n} \sum_j r_{\text{DARLING}}(x, y_j \mid Y).

Gradients are then computed with respect to these diversity-aware advantage terms. The resulting RL update prioritizes selection of actions or generations that occupy sparsely populated semantic regions with high reward, thereby expanding coverage of the solution space and catalyzing exploration during online RL (Li et al., 2 Sep 2025).

4. Empirical Evaluation across Regimes

DARLING has been evaluated in two principal regimes:

  1. Non-Verifiable Tasks: In instruction following and creative writing, where objective ground-truth labels may be absent, DARLING models (e.g., Llama-3.1-8B-Instruct, Llama-3.3-70B-Instruct) consistently outperform quality-only RL baselines on external metrics such as AlpacaEval win rates. Notably, these models show increased output novelty (via “Distinct” and n-gram-based metrics), generating more variegated and richer creative responses without degrading base quality.
  2. Verifiable Tasks: In explicit problem-solving domains such as competition mathematics (AIME25, HMMT, OlympiadBench, Brumo), DARLING yields improvements in both pass@1 (solution quality) and pass@k (solution variety). By explicitly linking the reward signal to semantic diversity, models produce a wider range of valid solutions, increasing the likelihood of success under competitive evaluation schemes (Li et al., 2 Sep 2025).

The improvement in exploration as a result of diversity optimization is demonstrated by DARLING’s higher pass@k rates and its ability to avoid output collapse.

5. Comparison to Quality-Only RL Baselines

The key distinction between DARLING and conventional RL training is in the calibration and range of exploration. While quality-only RL sharpens the output distribution, often leading to high-confidence but narrow solution modes, DARLING’s diversity signal ensures multiple semantically distinct, high-quality options are explored and reinforced. This is especially salient in creative, open-domain tasks, where one-valued optimization leads to “mode collapse,” and in verifiable regimes with multiple valid reasoning paths. The paper demonstrates that explicitly optimizing for diversity does not trade off accuracy, but rather enlarges the effective solution frontier—improving both quality and novelty.

6. Methodological and Practical Implications

The integration of diversity-aware objectives into RL is a generalizable principle applicable to a spectrum of RL tasks—ranging from autonomous control, curriculum and environment design, population-based RL, to model post-training for generative systems. By replacing surface-based diversity proxies (e.g., n-gram distinctiveness, response length variability) with a learned semantic equivalence indicator, DARLING facilitates principled reward engineering for tasks where performance cannot be entirely captured by accuracy alone.

The framework is compatible with both online and offline RL algorithms, and can be instantiated on top of common policy gradient methods, actor-critic approaches, or group-based RL. Importantly, the learned semantic partition function can be adapted to domain-specific needs, allowing for tailored diversity metrics (e.g., in mathematical tasks, creative writing, or coding).

7. Future Research Directions

Potential directions for future research include:

  • Adapting DARLING-style diversity metrics for domains with ambiguous or composite solution criteria.
  • Exploring interaction effects between diversity-aware RL and adversarial robustness, particularly in the presence of non-stationary or adversarially designed environments.
  • Extending semantic partition-based diversity measurement to multi-agent RL settings, distributed RL, and population-based curriculum learning.
  • Investigating efficiency trade-offs and scaling properties when moving to very large-scale model post-training or agent populations.
  • Extending the learned semantic equivalence mechanism to multimodal outputs and hierarchical policy spaces for broader generalization.

This paradigm is positioned as a solution to the exploration–exploitation dilemma in RL, affirming that explicit semantic diversity signals—when integrated through partition-based, learned similarity measures—enrich both the exploration behavior and the ultimate quality of agent solutions (Li et al., 2 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)