Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 99 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 40 tok/s

GPT-5 High 38 tok/s Pro

GPT-4o 101 tok/s

GPT OSS 120B 470 tok/s Pro

Kimi K2 161 tok/s Pro

2000 character limit reached

Jointly Reinforcing Diversity and Quality in Language Model Generations (2509.02534v1)

Published 2 Sep 2025 in cs.CL and cs.LG

Abstract: Post-training of LLMs (LMs) often prioritizes accuracy and helpfulness at the expense of diversity. This creates a tension: while post-training improves response quality, it also sharpens output distributions and reduces the range of ideas, limiting the usefulness of LMs in creative and exploratory tasks such as brainstorming, storytelling, or problem solving. We address this challenge with Diversity-Aware Reinforcement Learning (DARLING), a framework that jointly optimizes for response quality and semantic diversity. At its core, DARLING introduces a learned partition function to measure diversity beyond surface-level lexical variations. This diversity signal is then combined with a quality reward during online reinforcement learning, encouraging models to generate outputs that are both high-quality and distinct. Experiments across multiple model families and sizes show that DARLING generalizes to two regimes: non-verifiable tasks (instruction following and creative writing) and verifiable tasks (competition math). On five benchmarks in the first setting, DARLING consistently outperforms quality-only RL baselines, producing outputs that are simultaneously of higher quality and novelty. In the second setting, DARLING achieves higher pass@1 (solution quality) and pass@k (solution variety). Most strikingly, explicitly optimizing for diversity catalyzes exploration in online RL, which manifests itself as higher-quality responses.

Collections

Summary

The paper introduces DARLING, a diversity-aware reinforcement learning framework that integrates semantic partitioning with multiplicative reward fusion.
The approach achieves superior quality and diversity metrics on tasks such as creative writing and competition math by optimizing semantic clustering.
Multiplicative fusion and a learned semantic classifier enable robust, scalable optimization that counteracts the common tradeoff between response quality and diversity.

Joint Optimization of Diversity and Quality in LLM Generations

Motivation and Problem Statement

The post-training of LLMs has predominantly focused on maximizing response quality, often at the expense of output diversity. This leads to a phenomenon where models, after reinforcement learning (RL) with human feedback or reward models, produce highly similar or even near-duplicate responses to the same prompt, limiting their utility in creative, exploratory, or open-ended tasks. The paper introduces Diversity-Aware Reinforcement Learning (DARLING, denoted as "black"), a framework that explicitly incorporates semantic diversity into the RL objective, aiming to jointly optimize for both response quality and diversity.

Methodology: Diversity-Aware Reinforcement Learning (DARLING)

DARLING extends the standard RL post-training paradigm by introducing a learned semantic partition function to measure diversity at the response level. The core pipeline is as follows:

Semantic Partitioning: For each prompt, multiple model generations are clustered into equivalence classes using a learned classifier that predicts semantic similarity, rather than relying on surface-level lexical overlap.
Diversity Metric: The diversity of a response is defined as the normalized count of semantically distinct responses within the set of generations for a prompt.
Reward Fusion: The effective reward for each response is computed as the product of its quality reward (from a reward model or task-specific metric) and its normalized diversity score. This multiplicative fusion ensures that only responses that are both high-quality and semantically distinct are strongly reinforced.
RL Objective: The diversity-aware reward is integrated into a Group Relative Policy Optimization (GRPO) framework, with modifications to advantage normalization and loss aggregation to improve stability and avoid bias toward longer sequences.
Figure 1: Diversity-Aware Reinforcement Learning (black): LLM generations are partitioned into semantically equivalent clusters; black amplifies the probability of diverse, high-quality responses.

Semantic Diversity Classifier

A key technical contribution is the use of a learned classifier to determine semantic equivalence between responses. This classifier is trained on human-annotated data and extended to handle long contexts (up to 8192 tokens). The classifier outperforms proprietary models in matching human judgments of semantic similarity, as shown in the paper's evaluation.

Figure 2: Classifier-based approaches outperform proprietary models in determining semantic similarity between responses.

Experimental Results: Non-Verifiable and Verifiable Tasks

Non-Verifiable Tasks

DARLING is evaluated on instruction following and creative writing tasks using Llama-3.1-8B-Instruct and Llama-3.3-70B-Instruct as base models. The method is compared against GRPO, DivPO, and GRPO-Unlikeliness baselines. Evaluation metrics include AlpacaEval 2.0, ArenaHard v2.0, EQ-Bench (ELO), and NoveltyBench (semantic and lexical diversity).

Key findings:

DARLING achieves the highest quality (win rates, ELO) and the highest semantic and lexical diversity across all benchmarks.
The method improves the Pareto front of the quality-diversity tradeoff, outperforming baselines at all sampling temperatures.
Figure 3: Quality-diversity tradeoff for models trained with GRPO and black; black achieves superior Pareto fronts at both 8B and 70B scales.

Qualitative analysis on EQ-Bench reveals that DARLING-trained models excel in rubrics such as "Interesting and Original" and "Avoids Cliche," directly attributable to increased semantic diversity.

Verifiable Tasks (Competition Math)

DARLING is further evaluated on competition math benchmarks (AIME25, OlympiadBench, HMMT, Brumo) using Qwen3-4B-Base and Qwen3-14B-Base. The diversity classifier is adapted to the math domain via additional annotation and fine-tuning.

Key findings:

DARLING outperforms GRPO in both pass@1 (solution quality) and pass@k (solution variety) across all benchmarks.
The largest gains are observed on the most challenging datasets, indicating that diversity-driven exploration is particularly beneficial in hard problem settings.
Figure 4: black outperforms GRPO for both pass@1 and pass@128 on competition math benchmarks, demonstrating simultaneous gains in quality and diversity.

Ablation Studies

The paper conducts extensive ablations to justify design choices:

Multiplicative vs. Additive Reward Fusion: Multiplicative fusion of quality and diversity rewards yields more robust improvements and avoids the need for scale tuning.
Semantic vs. Lexical Diversity: Using n-gram-based diversity rewards underperforms semantic partitioning, especially in tasks requiring deep understanding (e.g., math), and can be gamed by the model.
Advantage Normalization: Removing standard deviation normalization in advantage calculation improves performance in settings with dense, noisy rewards, as it prevents amplification of spurious differences.

Implementation Considerations

Classifier Training: The semantic equivalence classifier must be trained on domain-relevant, human-annotated data and extended to the maximum context length expected in deployment.
Batching and Rollouts: For each prompt, multiple rollouts are required to compute diversity; this increases computational cost but is necessary for accurate diversity estimation.
Reward Model Selection: The quality reward should be well-calibrated and aligned with the target application; the diversity reward is domain-agnostic but depends on classifier accuracy.
Scalability: The method is demonstrated at both 8B and 70B parameter scales, with consistent improvements, indicating scalability to large models.

Implications and Future Directions

DARLING demonstrates that explicit optimization for semantic diversity during RL post-training can simultaneously improve both the quality and variety of LLM outputs, contradicting the common assumption of an inherent tradeoff. This has significant implications for applications in creative generation, scientific discovery, and problem solving, where diversity is critical.

The approach is orthogonal to and compatible with inference-time diversity-promoting techniques (e.g., temperature scaling, diverse beam search) and can be combined with them for further gains. The reliance on a learned semantic classifier opens avenues for further research into more generalizable and robust diversity metrics, potentially leveraging contrastive or self-supervised objectives.

The findings suggest that future LLM training pipelines should incorporate diversity-aware objectives, especially for tasks where exploration and creativity are valued. The method's success in both non-verifiable and verifiable domains indicates broad applicability.

Conclusion

DARLING (black) provides a principled and empirically validated framework for jointly optimizing quality and diversity in LLM generations. By leveraging a learned semantic partition function and integrating diversity into the RL reward, the method overcomes diversity collapse and achieves superior performance across a range of tasks and model scales. This work establishes a new standard for diversity-aware LLM post-training and motivates further exploration of semantic-level diversity metrics and their integration into large-scale LLM optimization.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (8)

Tweets

https://twitter.com/jaseweston/status/1963230744173482018

https://twitter.com/fly51fly/status/1963353789291008354

https://twitter.com/DanielKhashabi/status/1963687411386609704

https://twitter.com/AINativeF_zh/status/1963413839518793983

alphaXiv

Jointly Reinforcing Diversity and Quality in Language Model Generations (26 likes, 0 questions)