Diversity-Optimized Sampling Strategy

Updated 24 December 2025

DOSS is a methodology that enforces diversity on subsampled datasets using metrics like pairwise distance, entropy, and determinantal measures.
It employs techniques such as clustering, DPP-based selection, and max–min heuristics to optimize sampling in applications like neural network training and out-of-distribution regularization.
Practical implementations of DOSS demonstrate improved coverage, enhanced kernel methods, and robust performance in imbalanced learning and continual learning scenarios.

A Diversity-Optimized Sampling Strategy (DOSS) is a class of methodologies explicitly designed to maximize diversity within subsampled datasets, solution sets, or sampling trajectories, subject to varied domain-specific constraints and performance objectives. DOSS frameworks are deployed in structured constraint solving, large-scale subsampling, neural network training, domain adaptation, out-of-distribution (OOD) regularization, imbalanced data regimes, kernel-based learning, continual learning, and high-dimensional synthesis, among other tasks. The essential property of DOSS is algorithmic enforcement of sampling diversity—operationalized via bespoke diversity metrics—thereby achieving superior coverage, information content, or representational spread compared to naïve or uncertainty-driven sampling.

1. Mathematical Principles and Formulations

Central to DOSS is the rigorous definition of diversity in the target domain and its integration into the sampling protocol. The precise objective and metrics depend on context but commonly include:

Pairwise spread: Maximization of the minimum or mean pairwise distance in feature/solution space, e.g., $\sum_{i<j} d(x_i, x_j)$ , as in iterative sample selection for augmented datasets (Cavusoglu et al., 2021) and rehearsal-based continual learning (Nokhwal et al., 2023).
Coverage/bit-coverage: For structured spaces such as SMT solutions, diversity is measured by the fraction of unique bit-patterns, e.g., $\mathrm{Coverage}(S) = \frac{1}{|B|}\sum_{b\in B} \mathrm{cov}(b;S)$ in HighDiv for SMT(LIA) sampling (Lai et al., 25 Feb 2025).
Entropy/variety: In corpus or categorical data sampling, Shannon or Rényi entropy of token or structural distributions quantifies diversity (Estève et al., 14 Jan 2025).
Determinantal metrics: Determinantal Point Processes (DPPs) select size- $k$ subsets with probability proportional to $\det(L_A)$ (for kernel $L$ and subset $A$ ), emphasizing volume in feature space and negative correlations (Fanuel et al., 2020, Napoli et al., 2024).
Rate-distortion diversity: Semantic diversity via $R(\mathbf Z,\epsilon) = \tfrac12\log\det(I + \alpha\mathbf Z\mathbf Z^{\!\top})$ , subtracting class-conditional rates for task-oriented scores (Chen et al., 2023).
Energy distance/uniformity: Energy distance to reference distributions is minimized to ensure global space-filling coverage in unsupervised DOSS (Shang et al., 2022).
Max–min objectives: The k-center problem with or without outlier robustness, maximizing minimal distances subject to memory or class constraints (Nokhwal et al., 2023).
Multi-objective tradeoffs: Explicit balancing of diversity and relevance, e.g., multi-objective reward $reward(S) = f_\mathrm{diversity}(S) \times f_\mathrm{relevance}(S)$ in recommender DOSS (Bederina et al., 22 Jun 2025), or bilevel optimization maximizing both coverage and predictive accuracy in imbalanced settings (Medlin et al., 12 Jun 2025).

2. Core Algorithmic Methodologies

DOSS implementations employ a variety of algorithmic building blocks, often tuned for computational tractability and domain structure:

Clustering-based sampling: K-means or k-means++ clustering yields representatives from separated regions of the sample/configuration space, as in adaptive RL-based DNN compilation (Ahn et al., 2019), OOD mini-batch construction (Jiang et al., 2023), or convex hull–volume maximization in streaming video summarization (Anirudh et al., 2016).
Local search with randomized or diversity-aware move operators: HighDiv for SMT(LIA) leverages a boundary-aware move operator (bam) that randomizes variable assignments within feasible intervals, combined with variable-frequency initialization and stochastic CDCL(T) restarts (Lai et al., 25 Feb 2025).
DPP-based selection: Exact or approximate DPP sampling increases the probability of diverse landmark or sample selection, with guarantees on coverage and variance reduction (Fanuel et al., 2020, Napoli et al., 2024, Chen et al., 2023).
Density-aware weighted resampling: Reweighting candidate samples inversely to estimated local density yields uniform or custom-distribution subsamples, as in GMM-driven DS algorithms (Shang et al., 2022).
Greedy max–min or k-center heuristics: Select points by greedily maximizing the minimal distance to current exemplars, optionally filtered for outlier-robustness (Nokhwal et al., 2023, Cavusoglu et al., 2021).
PCA/projection-based ordered sampling: In text data, PCA eigenspace is exploited to systematically pick points with extreme projections or outlier norms for maximized semantic coverage (Tiwari et al., 12 Mar 2025).
Bi-modal or phase-transition strategies: Mode switch from diversity-optimized selection (e.g., DPP/MAP) to uncertainty-based or error-driven sampling after the marginal diversity gain falls below a threshold (Chen et al., 2023).

3. Domain-Specific Applications

DOSS is instantiated in diverse scientific and engineering domains:

SMT and Constraint Sampling: HighDiv deploys preprocessing, local search with bam, and stochastic CDCL(T) to sample highly diverse models for SMT(LIA), outperforming MeGASampler and SMTSampler(Int) by up to 15–20 percentage points in coverage (Lai et al., 25 Feb 2025).
Data Subsampling for Supervised/Unsupervised Learning: The DS algorithm generates nearly i.i.d. uniform subsamples by reweighting by $g(x)/\hat f(x)$ , efficiently covering low-density regions absent in conventional DPP or distance-based samplers (Shang et al., 2022).
OOD Regularization: DOS (Diverse Outlier Sampling) clusters feature representations per iteration and selects the most informative outlier from each cluster, yielding 25.79% absolute improvement in average FPR95 on CIFAR-100 over NTOM (Jiang et al., 2023).
Imbalanced Data Learning: Multi-objective bilevel DOSS in the MOODS framework combines SVM-SMOTE oversampling, majority class undersampling, and an $\epsilon/\delta$ diversification metric, producing up to 15% test F1 improvements on UCI datasets (Medlin et al., 12 Jun 2025).
Kernel and Nyström Methods: DPP-optimized kernel landmark selection acts as an implicit regularizer, reducing Nyström approximation error and improving worst-case regression performance in low-density regions by up to 25% (Fanuel et al., 2020).
Continual and Incremental Learning: DSS utilizes a robust greedy k-center with neighbor filters over t-SNE embeddings to select rehearsal exemplars, controlling catastrophic forgetting with 10-12% higher accuracy than BiC or GDumb (Nokhwal et al., 2023).
Streaming Summarization: In online video summarization, DOSS trades off k-means clustering fidelity and convex hull volume increase under memory constraints, outperforming batch and k-medoids methods both in representation and computational resources (Anirudh et al., 2016).
Speech Deepfake Detection and Dataset Curation: DOSS-Select (pruning) and DOSS-Weight (domain-weighted sampling) in speech deepfake data aggregation exploit scaling laws of source/generator diversity, reducing average EER by 29% over naïve pooling and reliably transferring to challenging commercial TTS benchmarks (Huang et al., 20 Dec 2025).

4. Diversity Metrics and Theoretical Guarantees

Selection and evaluation of DOSS methods are dependent on measurable diversity metrics:

Coverage metrics: For bit- or category-level coverage, as in HighDiv or French lexical/syntactic corpus DOSS (Lai et al., 25 Feb 2025, Estève et al., 14 Jan 2025).
Entropy measures: Rényi or Shannon entropy over token, subtree, or label distributions (Estève et al., 14 Jan 2025).
Pairwise and setwise distances: $\sum_{i<j} d(x_i, x_j)$ , minimal radius in t-SNE/k-center methods (Nokhwal et al., 2023, Cavusoglu et al., 2021).
Determinantal kernels and log-volume: $\log\det(K_{A,A})$ for DPP, volume of convex hulls in streaming settings (Fanuel et al., 2020, Anirudh et al., 2016).
Variance and overlap decrease: $\epsilon/\delta$ metric, quantifying reduction in class overlap and order-of-magnitude gain in activation variance, for imbalanced sampling (Medlin et al., 12 Jun 2025).
AWOP for ordered selection: Aggregated wasted-opportunity in prefix-ordered text subsampling (Tiwari et al., 12 Mar 2025).
Uniformity energy distance: Quantifies agreement of empirical subsamples with target density support (Shang et al., 2022).
Han-Kobayashi-style rate-distortion: Semantic DOSS scoring in multi-class scenarios (Chen et al., 2023).

Provable guarantees (e.g., convergence to target distributions, variance reductions over i.i.d. sampling, or submodular maximization bounds) are detailed in context for several methods (Shang et al., 2022, Fanuel et al., 2020, Nokhwal et al., 2023).

5. Practical Implementation and Computational Considerations

Implementing DOSS requires attention to efficiency, scalability, and parameter tuning:

Complexity control: Linear- or quasi-linear-time sampling via density estimation or MaxHeap-based selection (Shang et al., 2022, Cavusoglu et al., 2021); DPP sampling is cubic in the dataset for exact spectral methods, but scalable via MCMC, greedy swap, or kernel approximation (Napoli et al., 2024).
Batch and streaming scenarios: DOSS is adapted for streaming (e.g., online convex hull in video summarization) or batch/iterative optimization (e.g., bilevel selection in MOODS) (Anirudh et al., 2016, Medlin et al., 12 Jun 2025).
Parameterization: Hyperparameters include cluster number (k in k-means), diversity–relevance trade-off ( $\beta$ or $\lambda$ ), batch size, DPP kernel bandwidth, entropy order $\alpha$ , and phase-transition thresholds (Fanuel et al., 2020, Chen et al., 2023, Anirudh et al., 2016).
Initialization and warm-starts: Empirical findings support initializing with random or coverage-driven samples; embedding recalibration and interval updates for feature-space diversity (Nokhwal et al., 2023, Napoli et al., 2024).
Robustness to class imbalance and noise: DOSS can upweight rare classes (through weighted similarity or k-means++ seeding), filter outliers with neighborhood constraints, and dynamically adapt sampling to maintain balance (Medlin et al., 12 Jun 2025, Nokhwal et al., 2023).
Implementation in deep learning pipelines: Features from fixed or self-supervised embedders; integration into mini-batch selection for distribution alignment; direct tie-in with loss functions (e.g., absent-category, bilevel F1 metrics) (Jiang et al., 2023, Napoli et al., 2024).

6. Empirical Impact and Comparative Results

The empirical superiority of DOSS approaches is well-documented:

SMT Sampling: HighDiv raises coverage by 10–20 points over MeGASampler on LIA benchmarks (Lai et al., 25 Feb 2025).
Uniform Subsampling: DS algorithm achieves energy distances closest to true uniform and superior low-density region coverage relative to DPP, CADEX, PDS, or scSampler, at reduced computational cost (Shang et al., 2022).
OOD Sampling: DOS reduces FPR95 by up to 25.79% over uncertainty-driven baselines on CIFAR-100 and achieves global improvements on ImageNet OOD benchmarks (Jiang et al., 2023).
Incremental Learning: DSS attains a 10–12% boost in average accuracy on CIFAR-100 compared with BiC or GDumb (Nokhwal et al., 2023).
Imbalanced Classification: MOODS-DOSS increases test F1 by 1–15%, with $\epsilon/\delta$ diversity increases correlating with performance (Medlin et al., 12 Jun 2025).
Distribution Alignment: k-DPP and k-means++ minibatch samplers (DOSS) reduce MAPE of MMD estimates by ~30–50%, decrease quantization error by 36–65%, and consistently raise OOD accuracy over class-weighted random sampling (Napoli et al., 2024).
Data-Centric Curation: DOSS-Select matches or surpasses large-data baselines with 3% of the data, and DOSS-Weight yields a 29% average EER reduction on public and commercial speech deepfake datasets (Huang et al., 20 Dec 2025).
Kernel Methods: DPP or greedy diverse landmark selection lower out-of-distribution prediction error and approximation risk, outperforming uniform or leverage-score-only sampling (Fanuel et al., 2020).

7. Limitations and Open Problems

Limitations of current DOSS methods, as documented, include:

Kernel matrix rank and scalability: DPP upper bounds depend on kernel rank; phase transitions in diversity gains are observed, limiting additional benefit beyond initial warm-up (Chen et al., 2023, Napoli et al., 2024).
Outlier sensitivity: Aggressive diversity maximization can overweight outliers, potentially reducing downstream accuracy; careful regularization or weighting is required (Fanuel et al., 2020, Napoli et al., 2024).
Noise and feature drift: Early in training or with dynamic features, DOSS diversity may reflect noise rather than meaningful coverage; periodic re-embedding or warm-up random batches are advisable (Napoli et al., 2024).
Hyperparameter selection and sensitivity: Parameters such as batch size, kernel bandwidth, entropy order, and diversity–relevance weight substantially affect performance and may require task-specific tuning (Medlin et al., 12 Jun 2025, Fanuel et al., 2020).
Generalization to new modalities: Extensions to online, kernelized, or continuous attribute DOSS for non-discrete or multilingual data are open research directions (Huang et al., 20 Dec 2025, Chen et al., 2023).

DOSS frameworks systematize and formalize the quest for diverse, information-rich, and balanced samples across domains, supporting theoretical guarantees and demonstrating empirical efficacy against state-of-the-art baselines in multiple modalities. The spectrum of DOSS methodologies encompasses structured search, density weighting, clustering, determinantal processes, and explicit submodular maximization, unified by their orientation toward principled diversity as a central criterion in data selection and sampling strategies (Lai et al., 25 Feb 2025, Shang et al., 2022, Jiang et al., 2023, Nokhwal et al., 2023, Napoli et al., 2024, Huang et al., 20 Dec 2025, Fanuel et al., 2020, Medlin et al., 12 Jun 2025, Anirudh et al., 2016).