Diversifying Sample Condensation (DISCO)

Updated 10 October 2025

DISCO is a paradigm that condenses datasets by maximizing diversity in model responses, focusing on inter-model disagreement to capture crucial information.
It employs metrics like Predictive Diversity Score and Jensen-Shannon Divergence for sample selection, avoiding costly clustering by using a greedy, information-driven approach.
DISCO underpins practical advances in benchmarking, continual learning, generative modeling, and graph-based condensation, significantly reducing computational resources.

Diversifying Sample Condensation (DISCO) is a paradigm for selecting or synthesizing highly informative, compact datasets that maximize the diversity of model responses, thereby enabling efficient model evaluation and training across a range of machine learning domains. Unlike approaches that only emphasize variety in data itself, DISCO targets what is most informative for specific downstream tasks, typically by maximizing the diversity in predictions (or gradients) elicited from models when exposed to different data samples. This principle motivates an array of methodologies spanning selection, synthesis, and regularization strategies, with continuous innovation driven by needs in benchmarking, continual learning, generative modeling, and graph-based data compression.

1. Motivations and Principles

The foundation of DISCO lies in the observation that model evaluation, adaptation, and training become prohibitively expensive as datasets and architectures scale. Historically, benchmark evaluation required scoring massive numbers of samples, with resource demands increasing sharply for tasks such as LMMs-Eval and HELM. To reduce costs and redundancy, condensation schemes have aimed to select anchor subsets or synthesize representative datasets, often by clustering or statistical matching.

DISCO diverges from this tradition by identifying that the most informative subsets are not those that merely span the data space or reproduce its statistics, but those on which models disagree most sharply in their responses. This shift is formalized by measuring inter-model disagreement, using metrics such as the Predictive Diversity Score (PDS) or Jensen-Shannon Divergence (JSD), with information-theoretic results proving that samples maximizing model disagreement carry maximal mutual information about model differences (Rubinstein et al., 9 Oct 2025). A plausible implication is that focusing on diversity of model responses (rather than just input diversity) yields more efficient and impactful evaluation procedures.

2. Methodological Framework

DISCO implementations generally follow a two-step pipeline: selection and prediction.

Sample Selection

Model Response Diversity Top- $k$ samples are chosen based on per-sample model disagreement. For a set of models $\{m\}$ and classes $\{c\}$ , for each sample $x$ ,

$\mathrm{PDS}(x) = \frac{1}{C} \sum_{c=1}^C \max_{m} f_c^m(x)$

where $f_c^m(x)$ is the predicted probability by model $m$ (Rubinstein et al., 9 Oct 2025).

Jensen-Shannon Divergence Mutual information among model outputs for each sample is

$MI(\mathbb{S}(m); \hat{y}_i) = H(\hat{y}_i) - \frac{1}{M} \sum_{m=1}^M H(\hat{y}_i^m) = JSD(\hat{y}_i^1, ..., \hat{y}_i^M)$

Samples with highest JSD are greedily selected (Rubinstein et al., 9 Oct 2025).

Greedy vs. Clustering DISCO avoids costly clustering schemes in favor of sample-wise greedy selection based on information content, simplifying the process and reducing sensitivity to prior design choices.

Performance Prediction

Model Signatures Each target model is represented by its concatenated outputs on the selected samples, forming a high-dimensional "signature".
Regression A regression model (Random Forest, KNN, or neural net after optional PCA) maps signatures to predicted full-benchmark scores, trained using existing source models.

3. Advances in Dataset Synthesis and Diversity Regularization

Although DISCO was initially motivated by efficient evaluation (Rubinstein et al., 9 Oct 2025), analogous principles have led to innovations in generative condensation, continual learning, and beyond.

Contrastive Gradient Signals Condensation with contrastive signals (DCC) collectivizes gradient matching over all classes, capturing interclass distinctions and promoting diverse, discriminative synthetic samples (Lee et al., 2022).
Intra- and Inter-Class Losses Generative condensation frameworks introduce intra-class diversity losses (pushing samples within a class apart in feature space) and inter-class margins (pushing class centers apart), yielding distributions that cover the span of each class more fully (Zhang et al., 2023).
Distribution Decomposition Some methods explicitly match both content and style (distinct moments of feature maps) and maximize intra-class KL divergence, ensuring that condensed sets reflect not only semantic centers but also stylistic variability (Malakshan et al., 6 Dec 2024).
Directed Weight Adjustments Diversity-Driven Synthesis applies directed perturbations to synthesize batches that mirror broader variations in the training data. Decoupling variance regularization from mean alignment ensures that synthesized samples are distributed in feature space, not collapsed to centroids (Du et al., 26 Sep 2024).
Graph Condensation Disentangled condensation of large-scale graphs (DisCo framework) separates node feature synthesis from edge generation. Node features are matched via class centroid alignment and anchor regularizers, edges are transferred using pre-trained link predictors, thus scaling graph condensation to >100M nodes (Xiao et al., 18 Jan 2024).

4. Empirical Gains and Benchmarking

DISCO and its variants report substantial empirical improvements:

Model Evaluation Across language and vision benchmarks (MMLU, HellaSwag, ARC, ImageNet), DISCO reduces the test set by >99% with minimal mean absolute error (<1.1%) and high rank correlation (~0.99), outperforming random and anchor-based selection (Rubinstein et al., 9 Oct 2025).
Synthetic Dataset Training DCC and generative condensation methods display improved accuracy and sharper class separation, especially for fine-grained and complex datasets (e.g., DCC outperforms DC, DSAC, random selection on SVHN, CIFAR variants, and fine-grained benchmarks) (Lee et al., 2022, Zhang et al., 2023).
Graph and Continual Learning DisCo graph condensation achieves performance within 2–5% of full graph baselines with tiny reduction rates (0.05–0.2%) and accelerates graph compression >10× (Xiao et al., 18 Jan 2024).
Generative Training D2C condensation for diffusion models enables training with only 0.8% of data at 100× speedup, FID as low as 4.3, with high-fidelity images (Huang et al., 8 Jul 2025).

5. Theoretical Insights

DISCO’s efficacy is grounded in proven information-theoretic relationships:

Optimal Sample Selection Selecting samples maximizing JSD among model predictions is theoretically shown to maximize mutual information between predictive scores and the model’s evaluation outcome (Rubinstein et al., 9 Oct 2025). The linkage between prediction variance and generalization error supports the greedy approach over clustering.
Discrepancy and Diversity Condensation frameworks increasingly adopt discrepancy-based objectives, where the condensed empirical measure $\mu_{\mathcal{S}}$ is optimized to minimize integral probability metrics (IPM) or Wasserstein distances from the true distribution $\mu_{\mathcal{T}}$ (Chen et al., 12 Sep 2025). Diversity regularizers are mathematically incorporated via intra- and inter-class losses, adversarial objectives (RobDC), and privacy-aware formulations.

6. Implications, Applications, and Future Directions

The conceptual and empirical advances in DISCO have cascading implications:

Evaluation Cost and Sustainability By condensing the test set to highly informative exemplars, DISCO democratizes benchmarking, enables frequent re-evaluation, and dramatically reduces energy consumption.
Generalization and Bias Control Methods that promote response diversity are less likely to overfit idiosyncratic dataset features, potentially improving robustness and transferability.
Extensibility Across Modalities DISCO-inspired mechanisms are applicable to image, text, audio, graph, and even generative modeling tasks, with cross-modal versions employing content-style decomposition, semantic conditioning, or graph disentanglement.
Privacy and Robustness Discrepancy-based condensation allows injection of privacy-preserving noise and adversarial perturbation into the objective, supporting synthesis of datasets that are robust and privacy-aware (Chen et al., 12 Sep 2025).

A plausible implication is that future work may further integrate multi-modal semantic embeddings, dynamic augmentation, and hierarchical regularizers for broader and more resilient sample condensation. Variants of DISCO may become standard for curriculum generation, continual learning, or model selection pipelines.

7. Code Availability and Resources

Multiple DISCO and related frameworks provide open-source code:

DISCO model evaluation: https://github.com/arubique/disco-public (Rubinstein et al., 9 Oct 2025)
Diversity-Driven Synthesis: https://github.com/AngusDujw/Diversity-Driven-Synthesis (Du et al., 26 Sep 2024)
DisCo graph condensation: https://github.com/BangHonor/DisCo (Xiao et al., 18 Jan 2024)
Diffusion dataset condensation: https://github.com/AngusDujw/Diversity-Driven-Synthesis (Huang et al., 8 Jul 2025)

These resources support reproducible research, enable extension to new problems, and provide baseline implementations for benchmarking and integration.

In summary, Diversifying Sample Condensation (DISCO) offers a principled, empirically validated, and broadly extensible approach for selecting or synthesizing data subsets that maximize information content as measured by diversity in model responses. The approach is motivated by both computational efficiency and the desire for maximally informative evaluation, and it has spawned a spectrum of techniques and regularizers that optimize for intra-class spread, inter-class discrimination, generalization, robustness, and privacy.