Data-Diverse Drafts (DDD)
- Data-Diverse Drafts (DDD) is a principled strategy that constructs ensembles of models with enforced functional diversity to address underspecified learning tasks and OOD shifts.
- It leverages mutual information minimization and marginal-match regularization to induce diverse yet reliable predictions across different data partitions.
- DDD is applied to both robust classification and large language model uncertainty estimation, achieving state-of-the-art performance with efficient label use.
Data-Diverse Drafts (DDD) refers to a principled strategy for constructing a collection or ensemble of models (“drafts”), where each member is deliberately incentivized to capture semantically distinct but viable solutions to a given learning problem. DDD is motivated by the observation that many datasets are underspecified, meaning that the training data admits multiple equally-low-loss models that behave differently, especially under out-of-distribution (OOD) shift. DDD is employed both in robust hypothesis selection for classification and in scalable epistemic uncertainty quantification, underpinning recent advances in model disambiguation and efficient uncertainty estimation for LLMs.
1. Conceptual Foundation and Motivation
Underspecified settings, in which the labeled source data supports numerous functionally distinct solutions, expose a fundamental risk in classical empirical risk minimization (ERM): learned predictors optimize training loss but can fixate on spurious or unreliable features, leading to poor generalization. Standard ensembling or bagging methods do not remedy this, as they tend to average or select the simplest shortcut explanation.
DDD offers a remedy by enforcing functional diversity among a set of candidate models or “drafts.” In image and language tasks, DDD provides a finite, tractable covering of the Rashomon set of near-optimal hypotheses—those solutions that fit the training data but exploit different predictive features. In LLMs, DDD enables the construction of ensembles from lightweight surrogates for efficient epistemic uncertainty estimation, bypassing prohibitive costs of large-scale ensembling (Lee et al., 2022, Park et al., 2 Feb 2026).
2. Mathematical Formalism and Objectives
Diversification
Let be the labeled source set and an unlabeled (often target) distribution. For each head in an ensemble of drafts, DDD minimizes the standard training loss
while simultaneously encouraging functional disagreement on . This is achieved by minimizing the mutual information between head predictions:
where the empirical output histograms are estimated over . To prevent collapse to trivial uniformly random predictions, a marginal-match regularizer is introduced:
where is typically the source label distribution. The total loss is
Variance Proxies for Uncertainty
Within LLMs, DDD is used to efficiently estimate epistemic uncertainty (EU). Let be LoRA-adapted draft models, each trained on a disjoint subset of distillation data from a teacher . For a new input ,
- The mixture is computed.
- The variance proxy is given by the Jensen-Shannon divergence
- The bias proxy is
DDD’s aim is to maximize meaningful disagreement (JSD) among drafts while keeping close to the teacher (Park et al., 2 Feb 2026).
3. Algorithmic Implementation
DDD for Underspecification
- Initialization: Construct a shared backbone with heads.
- Stage 1 (Diversify): Alternate training on source data (cross-entropy) and enforcing diversity on target data (mutual information penalty + marginal-match regularizer).
- Stage 2 (Disambiguate):
- Choose among the drafts with minimal extra supervision by:
- Active querying: label the most-disagreeing unlabeled target points.
- Random querying: label a random subset.
- Source-inspection: compare saliency/Grad-CAM visualizations to select the head focusing on the true feature.
- Choose among the drafts with minimal extra supervision by:
DDD for LLM Uncertainty
- Distillation Data Preparation: Sample outputs per example from the teacher to form dataset .
- Partitioning: Split into disjoint partitions, each with all but different -samples.
- Draft Training: For each of drafts per partition, LoRA-fine-tune from or a base model with Kullback-Leibler loss relative to ’s predictions.
- Inference: Compute and JSD for each input using only the drafts.
DDD LLM Pseudocode (abbreviated from (Park et al., 2 Feb 2026))
1 2 3 4 5 6 7 8 9 10 |
for x in distillation_data: sample R outputs y^1,...,y^R from T build D = {(x, y^i)} partition D into S subsets, each with unique y-samples for x for each partition: for each of M drafts: LoRA-fine-tune draft on partition with KL loss to T return all K = S*M drafts |
4. Empirical Evaluation and Case Studies
Classification with Underspecification
Empirical results on a range of classification benchmarks demonstrate DDD’s impact:
- In 2D synthetic data, DDD’s heads reliably recover all possible decision boundaries, with individual heads achieving 90% OOD accuracy, while ERM baseline is stuck at 50% (Lee et al., 2022).
- On MNIST-CIFAR collage benchmarks, DDD with heads reconstructs all four ground-truth classifiers simultaneously.
- On complete-correlation datasets (Waterbirds-CC, CelebA-CC), DDD identifies robust, feature-based classifiers given as few as 4–16 labeled queries from the target domain.
- In medical imaging (CXR-14, Camelyon17-WILDS), DDD enables the separation of spurious shortcut features (e.g., the presence of a drain) from the true pathology, leading to improved worst-group AUC and OOD accuracy (up to 90%).
Epistemic Uncertainty in LLMs
- On GSM8K, DDD ensembles (8B3B drafts) halve RMSE (0.2036 vs. 0.3266–0.31) compared to “No-Train” or GKD/MiniLLM baselines.
- In hallucination detection, DDD achieves AUROC of 0.7839 (matching TokUR’s 0.7823) but with 25% lower computational overhead.
- DDD’s performance remains robust (RMSE 0.4437–0.4023) with draft sizes as small as 1B parameters.
- Data partition ablations (13, 23, 33 divisions of draft ensembles) indicate that 23 (as in DDD) yields near-optimal error at reduced training cost.
| Setting | RMSE (↓) | AUROC (↑) | Cost (× target) |
|---|---|---|---|
| No-Train (RMSE) | 0.3266 | — | — |
| DDD (3B draft, 2×3) | 0.2036 | 0.7839 | 1.08 |
| TokUR (8B halluc.) | — | 0.7823 | 1.00 |
| Draft Untrained | — | 0.7853 | 0.75 |
DDD is observed to deliver state-of-the-art token-level epistemic uncertainty and robust OOD classification (Park et al., 2 Feb 2026, Lee et al., 2022).
5. Theoretical Properties and Label Efficiency
- Coverage: In synthetic classification, ensembles of drafts span nearly the entire set of near-optimal hypotheses, providing a compressed model of the Rashomon set.
- Label Complexity: The number of labeled queries required to identify the correct draft head with high probability is for risk gap , often just 10–20 labels in practice.
- Variance–Bias Decomposition: DDD implements the variance proxy (JSD among the drafts) and allows explicit separation of bias and variance in epistemic uncertainty estimation. This is theoretically grounded in the expected KL-divergence decomposition:
- Label Efficiency: DDD achieves robust classification or uncertainty estimation with as few as 1–16 labels in post-hoc disambiguation.
6. Practical Considerations and Implementation Guidance
- Draft Model Size: 3B-parameter drafts (2×3) offer optimal accuracy (RMSE 0.20); 1B drafts (2×3) reduce inference cost by 42% at minor accuracy loss.
- Partitioning: The default of partitions and models per partition (total ) balances diversity with efficiency. Increasing yields diminishing returns in error reduction.
- Training Overhead: Each draft trains for just one epoch on its partition; overall runtime is comparable to LoRA-fine-tuning a single model.
- Inference Overhead: draft forward passes and mixture computation; with 1B drafts, this is approximately 58% of target model inference cost.
- Diversity Enforcement: Unlike injected-noise ensembles, DDD’s diversity comes exclusively from data partitioning; no parameter noise or randomization is applied during training.
7. Limitations and Open Questions
- Capacity-Dependent Performance: The effectiveness of DDD can degrade as draft model size shrinks, though it still outperforms other data-partition and LoRA-noise approaches.
- Partitioning Strategies: The method for data partitioning (random, stratified, by difficulty, etc.) may affect ensemble disagreement and thus performance. Automatic or adaptive partitioning is not yet explored.
- Data Efficiency: DDD requires multiple teacher outputs per example for effective draft diversity; optimizing example selection or investigating active strategies remains open.
- Dynamic Ensembles: Application of DDD to online or adaptive settings where draft membership evolves over time is an open research direction (Park et al., 2 Feb 2026).
A plausible implication is that DDD serves as a general template for robust prediction and efficient uncertainty estimation in both settings plagued by underspecification and by the computational limits of large-scale deep ensemble methods. It converts the multiplicity of viable explanations from a vulnerability to a resource, generating compact yet powerful ensembles that robustly handle OOD shifts and efficiently capture epistemic uncertainty.