Diversity-SFT: Fine-Tuning with Diversity

Updated 17 June 2026

Diversity-SFT is a supervised fine-tuning paradigm that explicitly measures and enforces diversity across semantic, functional, and token levels to improve model generalization.
It leverages advanced data selection techniques such as DPPs and UDS along with specialized loss functions like GEM and TOFU to balance quality with diversity.
Empirical results show enhanced robustness, label efficiency, and performance improvements in tasks ranging from instruction tuning to multilingual translation.

Diversity-Supervised Fine-Tuning (Diversity-SFT) is a class of methodologies for fine-tuning LLMs that systematically measures, enforces, and exploits data or output diversity as a principal axis of data selection, loss design, or evaluation. Unlike traditional supervised fine-tuning (SFT), which often neglects the explicit supervision of coverage across different instruction types, latent concepts, or response modes, Diversity-SFT employs formal mechanisms at both data engineering and algorithmic optimization stages to maximize diversity at various granularities—including semantic, functional, token-level, and task-level—under explicit metrics. The Diversity-SFT paradigm is motivated by the observation that increased diversity (properly quantified and balanced) in SFT data and objectives produces models with improved generalization, robustness, label efficiency, and output breadth, and mitigates mode collapse, overfitting, and catastrophic forgetting.

1. Principles and Taxonomies of Diversity in SFT

Diversity-SFT refers broadly to supervised fine-tuning protocols that deliberately control, sample, or regularize with respect to formal diversity criteria at one or more granularity levels. Several taxonomies have been proposed to clarify the axes and concrete implementations of diversity supervision:

Semantic Levels (Li et al., 30 May 2025):
- Macroscopic diversity: Coverage of high-level instruction intents or semantic topics. Algorithms operationalize this by clustering instruction or response embeddings and enforcing uniform sampling across clusters.
- Mesoscopic diversity: Coverage of tags or subunits (e.g., intent tags, unit-level function cues) extracted via LLMs, then clustered, followed by stratified sampling to cover tag groups.
- Microscopic diversity: Control of token-level properties (e.g., vocabulary coverage, n-gram entropy) in model responses. Sampling is performed so as to maximize type-token ratio, mid-frequency token coverage, or information entropy.
Instructional Coverage, Complexity, and Tagging (Lu et al., 2023):
- Instruction complexity: Defined as average tag count per query.
- Coverage: Fraction of a large, atomic tag vocabulary covered by dataset instructions.
Task/Label Diversity (Arabelly et al., 29 Jul 2025):
- Coverage over predefined or inferred coarse-grained tasks (e.g., translation, question-answering). Allocation of data annotation or selection budget by inverse model confidence per task maximizes information gain and preserves task breadth.
Representational Diversity (Stap et al., 19 May 2025):
- For non-English or cross-lingual tasks, diversity is measured by the number of distinct language pairs (translation directions) and by the spread in model representational geometry (e.g., SVCCA scores or cluster dispersion).

These approaches may be combined hierarchically, with Diversity-SFT pipelines instantiating several levels, e.g., first ensuring macro-level topic/intent coverage, then mesoscopic subunit variety, and finally micro-level token entropy.

2. Dataset Curation Strategies and Metrics

Central to Diversity-SFT is precise dataset curation and the use of diversity metrics that have predictive value for downstream SFT performance.

Selection Algorithms

Complexity-First Diverse Sampling (Lu et al., 2023):
- Pool annotated data, rank by per-query complexity (tag count).
- Iteratively select examples to maximize new tag introduction at each step, until the full tag vocabulary is saturated or target set size is attained.
Determinantal Point Processes (DPPs) with Log-Determinant Distance (Wang et al., 2024):
- Compute compressed feature vectors (e.g., via LoRA-projected weight gradients, followed by JL projections).
- Define the dataset kernel and reference maximally-diverse kernel; diversity is scored by the log-determinant ratio.
- Greedy DPP maximization is used to select subsets, optionally mixing diversity and quality (e.g., output length) via a trade-off parameter.
Online Utility-Diversity Sampling (UDS) (Zou et al., 19 Oct 2025):
- At each training batch, score candidate examples by nuclear norm of their logits (intra-sample diversity and utility) plus mean embedding distance to a buffer of prior diverse samples (inter-sample diversity).
- Select top-K to optimize both objectives with bounded computational cost.

Diversity Metrics

Metrics span several granularities (Li et al., 30 May 2025, Lu et al., 2023, Wang et al., 2024):

Level	Metric	Formula/Definition
Macro	Cluster coverage	Fraction of semantic/topic clusters covered
Meso	Tag coverage	Fraction of tag groups covered (e.g., in InsTag tagging)
Micro	Token coverage, Entropy	Unique mid-frequency token ratio, n-gram ratio, self-BLEU
Other	Log-determinant distance	See above; size-normalized Gram-determinant ratio

Correlations are empirically established between these metrics and alignment/generalization on human or semi-automated evaluation suites (e.g., AlpacaEval, MT-Bench, Arena Hard).

3. Loss Functions and Algorithmic Supervision

Diversity-SFT is not confined to data curation; modifications to the SFT loss function are also central.

Game-Theoretic and Information-Theoretic Formulations (GEM, TOFU) (Li et al., 2024, Klypa et al., 30 Apr 2026):
- GEM replaces the forward-KL CE loss with a reverse-KL + entropy penalty, optimizing:
$L_\mathrm{GEM} = -\mathbb{E}_q[\log p_\theta] + \mathbb{E}_{p_\theta^\beta}[\log p_\theta],$

where $p_\theta^\beta$ is a softened model distribution with temperature parameter $\beta$ . - TOFU loss fuses GEM’s temperature-based smoothing with focal scaling to upweight rare or mispredicted tokens, mitigating both ignorance (underfitting of rare tokens) and forgetting (excessive divergence from pretrained prior).
Selective Entropy Regularization (SED-SFT) (Chen et al., 7 Feb 2026):
- Adds an entropy-maximizing or quadratic diversity-encouraging term to the CE loss, turned on only for token positions identified as having sufficient exploration mass (determined dynamically by top-k token probability sum).
- This balances accuracy with exploration, preventing mode collapse while not sacrificing correctness on deterministic tokens.
Set-Based Forking/Mode Loss for Reasoning (SSFT) (Jia et al., 1 Oct 2025):
- Permutation-invariant set loss matches distinct reasoning traces to "global forking tokens" via optimal assignment (Hungarian algorithm). The objective guarantees that independent reasoning modes are preserved, thus enabling parallel, accurate self-consistency in mathematical or multi-path tasks.
Process/Flow Supervison for Recommendation Diversity (Flower) (Gao et al., 10 Mar 2025):
- GFlowNet-based supervision decomposes outcome (item-level) rewards into flows across the generation tree; loss constraints match token transition likelihoods to these flows, guaranteeing coverage and high entropy in the model's generative policy.

4. Empirical Findings and Quantitative Results

Diversity-SFT consistently demonstrates gains in generalization, robustness, and label efficiency across a range of benchmarks and settings.

Instruction Tuning and General SFT (Lu et al., 2023, Li et al., 30 May 2025, Wang et al., 2024, Klypa et al., 30 Apr 2026):
- In carefully controlled ablations, both per-query complexity and tag coverage monotonically increase MT-Bench score; a 6K InsTag-maximized subset can outperform baselines using 70-125K examples (Lu et al., 2023).
- Microscopic response diversity yields the highest absolute improvement in pairwise evaluation scores (correlation slopes of up to 10.06 × 10–2 per percent diversity increase), superior to macroscopic (topic) or mesoscopic (tag) control (Li et al., 30 May 2025).
- DPP-LDD-based pruning increases AlpacaEval win rate by 8–23 points versus random selection at fixed budgets (Wang et al., 2024).
- TOFU loss reduces Self-BLEU while maintaining or improving answer utility and mathematical reasoning success (Klypa et al., 30 Apr 2026).
Label-Efficient and Task-Diverse SFT (Arabelly et al., 29 Jul 2025):
- Weighted task diversity (inverse-confidence budget allocation per task) reaches MMLU scores up to 4 points higher than full-data SFT, with up to 80% annotation savings.
Translation and Cross-Lingual Generalization (Stap et al., 19 May 2025):
- Expanding direction set from 10 to 132 increases COMET-strict from 0.448 to 0.812 (total average), eliminates off-target hallucinations, and increases representational overlap across languages.
- Adding beyond the optimal diversity threshold can reduce supervised pair scores, revealing a performance–diversity trade-off.
Efficiency-Sample Online Methods (Zou et al., 19 Oct 2025):
- UDS increases both throughput and accuracy, outperforming full-dataset and prior selection methods by 4–9 percentage points under strong data constraints. Ablations show both intra-sample and inter-sample diversity contribute independent gains.

5. Limitations, Trade-offs, and Best Practices

Diversity-SFT is subject to several practical and theoretical trade-offs:

Excessive diversity, particularly when not matched to model capacity or domain, can produce diminishing or negative returns—evidenced in translation when extending to >250 directions for 7B models (Stap et al., 19 May 2025).
Pure diversity maximization can select outliers at the expense of sample quality; hybrid metrics balancing diversity with output length, utility, or loss are typically optimal (Wang et al., 2024, Zou et al., 19 Oct 2025).
In highly diverse or human-curated corpora, additional DPP-style pruning yields little benefit (Wang et al., 2024).
TOFU and GEM introduce additional hyperparameters (β, γ) that require tuning, though defaults generalize well (Klypa et al., 30 Apr 2026, Li et al., 2024).
Fine-grained taggers (e.g., InsTag) may require significant computational resources for annotation and normalization pipelines, despite high tagging precision and consistency (Lu et al., 2023).
When underlying data lacks clear pre-existing task labels, task-diversity-based data selection may require clustering or heuristic classification, introducing ambiguity or overhead (Arabelly et al., 29 Jul 2025).

Best practices distilled from the literature include the use of microscopic response diversification whenever feasible, careful monitoring for oversaturation, balanced mixing of quality/diversity objectives, and exhaustive documentation of diversity parameters (number of clusters/tags/tokens, entropy, length) to enable robust re-use and benchmarking (Lu et al., 2023, Li et al., 30 May 2025, Klypa et al., 30 Apr 2026, Zou et al., 19 Oct 2025).

6. Outlook and General Significance

Diversity-SFT establishes diversity as a first-class citizen in LLM alignment and generalization research, both as a measurable property and as an explicit optimization goal. Empirically, it enables label-efficient SFT, improved out-of-distribution generalization, higher robustness to overfitting, and better performance under constrained compute. Notably, the field has moved from heuristic diversity proxies (e.g., number of tasks or human intuition) to mathematically principled, scalable, and interpretable metric-driven pipelines.

Recent work demonstrates the applicability of Diversity-SFT to diverse domains: instruction-following, code generation, open-domain chat, mathematical reasoning, multilingual translation, and recommendation systems. Methodologies span data curation (DPPs, UDS, round-robin sampling by confidence), loss design (GEM, TOFU, SED-SFT, SSFT, GFlowNet-based), and representational analyses (SVCCA, cluster overlap). Theoretical and empirical findings indicate ongoing gains from combining diversity maximization with standard SFT and RL-based objectives, subject to the capacity of the backbone model and the granularity of the diversity criterion.

Key open questions include the integration of diversity supervision into reinforcement learning (RLHF), dynamic adaptation of diversity budgets, extension to multimodal and retrieval-augmented models, and developing better methods for balancing diversity against specific task accuracy or safety constraints. The emerging paradigm considers not only the quantity but also the variety and coverage of learned behaviors as fundamental determinants of LLM performance.