Enhancing Diversity in Large Language Models via Determinantal Point Processes (2509.04784v1)

Published 5 Sep 2025 in cs.CL and cs.AI

Abstract: Supervised fine-tuning and reinforcement learning are two popular methods for post-training LLMs. While improving the model's performance on downstream tasks, they often reduce the model's output diversity, leading to narrow, canonical responses. Existing methods to enhance diversity are limited, either by operating at inference time or by focusing on lexical differences. We propose a novel training method named DQO based on determinantal point processes (DPPs) to jointly optimize LLMs for quality and semantic diversity. Our approach samples and embeds a group of responses for each prompt, then uses the determinant of a kernel-based similarity matrix to measure diversity as the volume spanned by the embeddings of these responses. Experiments across instruction-following, summarization, story generation, and reasoning tasks demonstrate that our method substantially improves semantic diversity without sacrificing model quality.

Summary

The paper introduces DQO, a novel training algorithm that jointly optimizes for semantic diversity and response quality in LLMs using DPPs.
It employs an embedding-based diversity measure calculated via the determinant of a kernel similarity matrix to promote broad semantic coverage.
Experimental results show that DQO outperforms reward-only methods on multiple tasks, achieving improved pass@N scores and diverse outputs.

Enhancing Diversity in LLMs via Determinantal Point Processes

Introduction

The paper introduces Diversity Quality Optimization (DQO), a training algorithm for LLMs that directly optimizes for both semantic diversity and output quality using determinantal point processes (DPPs). The motivation stems from the observation that post-training methods such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) often lead to a collapse in output diversity, resulting in models that produce narrow, canonical responses. Existing diversity-promoting methods are typically limited to inference-time interventions or lexical-level regularization, which fail to recover missing semantic modes once the model distribution has collapsed. DQO addresses these limitations by incorporating a principled, embedding-based diversity objective into the training process.

Determinantal Point Processes for Diversity

DQO leverages DPPs to quantify and promote diversity among generated responses. For each prompt, the model samples $k$ responses, embeds them into a semantic space using a pretrained encoder, and computes a kernel-based similarity matrix. The diversity score is defined as the determinant of this matrix, which geometrically corresponds to the volume spanned by the response embeddings. This approach is sensitive to linear independence and penalizes clustering, ensuring that diversity is measured at the group level rather than via pairwise distances. The determinant-based metric robustly captures semantic diversity, forcing the model to explore the full high-dimensional embedding space.

Quality-Diversity Objective and Optimization

The DQO objective augments the standard reward-based optimization with a diversity term:

$J_{Div}(\pi_\theta) = \mathbb{E}_{x,y_{1:k}\sim \pi_\theta(\cdot|x)}\left[\sum_{i=1}^k r(x,y_i) + \alpha \log \det(L_\phi(y_{1:k})) \right]$

where $r(x, y_i)$ is the reward for response $y_i$ , $\alpha$ controls the trade-off between quality and diversity, and $L_\phi(y_{1:k})$ is the kernel similarity matrix of the embeddings. The optimal policy samples groups of responses with probability proportional to the determinant of the reward-augmented Gram matrix, balancing semantic separation and response quality.

To address practical challenges such as numerical instability and high gradient variance, the algorithm regularizes the diversity term by adding an identity matrix to the kernel, ensuring boundedness. Additionally, a leave-one-out (loo) gradient estimator is employed to reduce variance, leveraging the eigenvalue interlacing theorem for theoretical guarantees.

Experimental Results

DQO is evaluated on four tasks: reasoning (GSM8K), summarization (CNN-dailymail), instruction-following (Dolly), and story generation (Common-Gen). The model is compared against baselines trained solely for reward optimization (GRPO for reasoning, PPO for other tasks).

Quality Metrics

DQO consistently matches or exceeds baseline performance on $pass@1$ and achieves substantial improvements on $pass@n$ for $n > 1$ , indicating that the model generates multiple high-quality, diverse responses per prompt.

Figure 1: $pass@n$ performance on GSM8K, demonstrating DQO's superior best-of-n quality compared to reward-only baselines.

Diversity Metrics

DQO outperforms baselines across six diversity metrics, including distinct-n, self-BLEU, self-ROUGE, and LLM-as-a-judge scores. Notably, advanced models (GPT-4o-mini) recognize the semantic diversity of DQO-generated outputs, confirming improvements beyond lexical variation.

Figure 2: Diversity metrics on GSM8K, showing DQO's consistent advantage over reward-only training.

Pareto Frontier Analysis

DQO achieves Pareto-optimal trade-offs between quality and diversity across varying training steps and sampling temperatures, consistently occupying the upper-right region of the quality-diversity space relative to baselines.

Figure 3: Pareto frontiers for quality and diversity on Gen and GSM8K, illustrating DQO's robust balance across training and inference settings.

Hyperparameter Ablation

Increasing $\alpha$ (diversity weight) and $k$ (number of sampled responses) enhances diversity but can reduce $pass@1$ if set excessively high. DQO demonstrates robust improvements over baselines across a wide range of hyperparameters, with $pass@10$ remaining stable even as diversity increases. Larger $k$ incurs additional computational cost but further promotes diversity.

Implementation Considerations

Embedding Model Selection: The quality of semantic diversity depends on the embedding model used. Sentence-transformers/all-MiniLM-L6-v2 is employed, but task-specific or adaptive embeddings may yield further improvements.
Reward Model: DQO relies on reward models that evaluate the entire response, mitigating reward hacking observed with outcome-based rewards.
Computational Overhead: Sampling multiple responses per prompt and computing determinants increases training cost, especially for large $k$ .
Numerical Stability: Regularization via identity matrix addition is essential for stable optimization.
Deployment: DQO is compatible with standard RL-based post-training pipelines and can be integrated with existing reward models and embedding encoders.

Limitations and Future Directions

DQO's reliance on pretrained embeddings introduces sensitivity to embedding quality and alignment with task semantics. Outcome-based rewards remain vulnerable to reward hacking, suggesting the need for more robust reward modeling. Future work may explore adaptive or task-specific diversity metrics, integration with preference-based RL, and scalable determinant computation for large $k$ .

Conclusion

DQO provides a principled framework for jointly optimizing semantic diversity and response quality in LLMs via DPPs. Extensive experiments demonstrate that DQO substantially enhances diversity without sacrificing quality, outperforming reward-only baselines across multiple tasks and metrics. The approach is robust to hyperparameter choices and offers clear geometric and probabilistic interpretations. Limitations related to reward modeling and embedding selection highlight promising avenues for further research in diversity-aware LLM training.