Data-Diverse Drafts (DDD)

Updated 2 March 2026

Data-Diverse Drafts (DDD) is a principled strategy that constructs ensembles of models with enforced functional diversity to address underspecified learning tasks and OOD shifts.
It leverages mutual information minimization and marginal-match regularization to induce diverse yet reliable predictions across different data partitions.
DDD is applied to both robust classification and large language model uncertainty estimation, achieving state-of-the-art performance with efficient label use.

Data-Diverse Drafts (DDD) refers to a principled strategy for constructing a collection or ensemble of models (“drafts”), where each member is deliberately incentivized to capture semantically distinct but viable solutions to a given learning problem. DDD is motivated by the observation that many datasets are underspecified, meaning that the training data admits multiple equally-low-loss models that behave differently, especially under out-of-distribution (OOD) shift. DDD is employed both in robust hypothesis selection for classification and in scalable epistemic uncertainty quantification, underpinning recent advances in model disambiguation and efficient uncertainty estimation for LLMs.

1. Conceptual Foundation and Motivation

Underspecified settings, in which the labeled source data supports numerous functionally distinct solutions, expose a fundamental risk in classical empirical risk minimization (ERM): learned predictors optimize training loss but can fixate on spurious or unreliable features, leading to poor generalization. Standard ensembling or bagging methods do not remedy this, as they tend to average or select the simplest shortcut explanation.

DDD offers a remedy by enforcing functional diversity among a set of candidate models or “drafts.” In image and language tasks, DDD provides a finite, tractable covering of the Rashomon set of near-optimal hypotheses—those solutions that fit the training data but exploit different predictive features. In LLMs, DDD enables the construction of ensembles from lightweight surrogates for efficient epistemic uncertainty estimation, bypassing prohibitive costs of large-scale ensembling (Lee et al., 2022, Park et al., 2 Feb 2026).

2. Mathematical Formalism and Objectives

Diversification

Let $S$ be the labeled source set and $T$ an unlabeled (often target) distribution. For each head $f_i$ in an ensemble of $N$ drafts, DDD minimizes the standard training loss

$L_{\text{xent}}(f_i) = \mathbb{E}_{(x,y)\sim S} \left[ -\log p(y|f_i(x)) \right],$

while simultaneously encouraging functional disagreement on $T$ . This is achieved by minimizing the mutual information between head predictions:

$L_{\text{MI}}(f_i, f_j) = \text{KL}\left[p(\hat{y}_i,\hat{y}_j) \;\|\; p(\hat{y}_i)p(\hat{y}_j)\right]_{x\sim T},$

where the empirical output histograms are estimated over $T$ . To prevent collapse to trivial uniformly random predictions, a marginal-match regularizer is introduced:

$L_{\text{reg}}(f_i) = \text{KL}\left[p(\hat{y}_i) \;\|\; p(y)\right]_{x\sim T},$

where $p(y)$ is typically the source label distribution. The total loss is

$\sum_{i} L_{\text{xent}}(f_i) + \lambda_1 \sum_{i\neq j} L_{\text{MI}}(f_i, f_j) + \lambda_2 \sum_{i} L_{\text{reg}}(f_i).$

Variance Proxies for Uncertainty

Within LLMs, DDD is used to efficiently estimate epistemic uncertainty (EU). Let $\{q_k\}$ be $K$ LoRA-adapted draft models, each trained on a disjoint subset of distillation data from a teacher $T$ . For a new input $x$ ,

The mixture $q_{\text{mix}} = \frac{1}{K} \sum_k q_k$ is computed.
The variance proxy is given by the Jensen-Shannon divergence

$\text{JSD}(q_1, ..., q_K) = \mathbb{E}_k [ \text{KL}(q_k \;\|\; q_{\text{mix}}) ]$

The bias proxy is $\text{KL}(q_{\text{mix}} \;\|\; p_T)$

DDD’s aim is to maximize meaningful disagreement (JSD) among drafts while keeping $q_{\text{mix}}$ close to the teacher $p_T$ (Park et al., 2 Feb 2026).

3. Algorithmic Implementation

DDD for Underspecification

Initialization: Construct a shared backbone with $N$ heads.
Stage 1 (Diversify): Alternate training on source data (cross-entropy) and enforcing diversity on target data (mutual information penalty + marginal-match regularizer).
Stage 2 (Disambiguate):
- Choose among the $N$ $N$ drafts with minimal extra supervision by:
  - Active querying: label the most-disagreeing $m$ unlabeled target points.
  - Random querying: label a random subset.
  - Source-inspection: compare saliency/Grad-CAM visualizations to select the head focusing on the true feature.

DDD for LLM Uncertainty

Distillation Data Preparation: Sample $R$ outputs $y^1,\dots, y^R$ per example $x$ from the teacher $T$ to form dataset $D$ .
Partitioning: Split $D$ into $S$ disjoint partitions, each with all $x$ but different $y$ -samples.
Draft Training: For each of $M$ drafts per partition, LoRA-fine-tune from $T$ or a base model with Kullback-Leibler loss relative to $T$ ’s predictions.
Inference: Compute $q_{\text{mix}}$ and JSD for each input using only the drafts.

for x in distillation_data:
    sample R outputs y^1,...,y^R from T
    build D = {(x, y^i)}

partition D into S subsets, each with unique y-samples for x

for each partition:
    for each of M drafts:
        LoRA-fine-tune draft on partition with KL loss to T
return all K = S*M drafts

4. Empirical Evaluation and Case Studies

Classification with Underspecification

Empirical results on a range of classification benchmarks demonstrate DDD’s impact:

In 2D synthetic data, DDD’s heads reliably recover all possible decision boundaries, with individual heads achieving $>$ 90% OOD accuracy, while ERM baseline is stuck at 50% (Lee et al., 2022).
On MNIST-CIFAR collage benchmarks, DDD with $N=4$ heads reconstructs all four ground-truth classifiers simultaneously.
On complete-correlation datasets (Waterbirds-CC, CelebA-CC), DDD identifies robust, feature-based classifiers given as few as 4–16 labeled queries from the target domain.
In medical imaging (CXR-14, Camelyon17-WILDS), DDD enables the separation of spurious shortcut features (e.g., the presence of a drain) from the true pathology, leading to improved worst-group AUC and OOD accuracy (up to 90%).

Epistemic Uncertainty in LLMs

On GSM8K, DDD ensembles (8B $\rightarrow$ 3B drafts) halve RMSE (0.2036 vs. 0.3266–0.31) compared to “No-Train” or GKD/MiniLLM baselines.
In hallucination detection, DDD achieves AUROC of 0.7839 (matching TokUR’s 0.7823) but with 25% lower computational overhead.
DDD’s performance remains robust (RMSE 0.4437–0.4023) with draft sizes as small as 1B parameters.
Data partition ablations (1 $\times$ 3, 2 $\times$ 3, 3 $\times$ 3 divisions of draft ensembles) indicate that 2 $\times$ 3 (as in DDD) yields near-optimal error at reduced training cost.

Setting	RMSE (↓)	AUROC (↑)	Cost (× target)
No-Train (RMSE)	0.3266	—	—
DDD (3B draft, 2×3)	0.2036	0.7839	1.08
TokUR (8B halluc.)	—	0.7823	1.00
Draft Untrained	—	0.7853	0.75

DDD is observed to deliver state-of-the-art token-level epistemic uncertainty and robust OOD classification (Park et al., 2 Feb 2026, Lee et al., 2022).

5. Theoretical Properties and Label Efficiency

Coverage: In synthetic classification, ensembles of $N=20$ drafts span nearly the entire set of near-optimal hypotheses, providing a compressed model of the Rashomon set.
Label Complexity: The number of labeled queries $m$ required to identify the correct draft head with high probability is $O(\log N/\Delta^2)$ for risk gap $\Delta$ , often just 10–20 labels in practice.
Variance–Bias Decomposition: DDD implements the variance proxy (JSD among the drafts) and allows explicit separation of bias and variance in epistemic uncertainty estimation. This is theoretically grounded in the expected KL-divergence decomposition:

$\mathbb{E}_k [ \text{KL}(q_k \;\|\; p_T) ] = \text{JSD}(q_1, ..., q_K) + \text{KL}(q_{\text{mix}} \;\|\; p_T)$

Label Efficiency: DDD achieves robust classification or uncertainty estimation with as few as 1–16 labels in post-hoc disambiguation.

6. Practical Considerations and Implementation Guidance

Draft Model Size: 3B-parameter drafts (2×3) offer optimal accuracy (RMSE $\approx$ 0.20); 1B drafts (2×3) reduce inference cost by 42% at minor accuracy loss.
Partitioning: The default of $S=2$ partitions and $M=3$ models per partition (total $K=6$ ) balances diversity with efficiency. Increasing $S$ yields diminishing returns in error reduction.
Training Overhead: Each draft trains for just one epoch on its partition; overall runtime is comparable to LoRA-fine-tuning a single model.
Inference Overhead: $K$ draft forward passes and mixture computation; with 1B drafts, this is approximately 58% of target model inference cost.
Diversity Enforcement: Unlike injected-noise ensembles, DDD’s diversity comes exclusively from data partitioning; no parameter noise or randomization is applied during training.

7. Limitations and Open Questions

Capacity-Dependent Performance: The effectiveness of DDD can degrade as draft model size shrinks, though it still outperforms other data-partition and LoRA-noise approaches.
Partitioning Strategies: The method for data partitioning (random, stratified, by difficulty, etc.) may affect ensemble disagreement and thus performance. Automatic or adaptive partitioning is not yet explored.
Data Efficiency: DDD requires multiple teacher outputs per example for effective draft diversity; optimizing example selection or investigating active strategies remains open.
Dynamic Ensembles: Application of DDD to online or adaptive settings where draft membership evolves over time is an open research direction (Park et al., 2 Feb 2026).

A plausible implication is that DDD serves as a general template for robust prediction and efficient uncertainty estimation in both settings plagued by underspecification and by the computational limits of large-scale deep ensemble methods. It converts the multiplicity of viable explanations from a vulnerability to a resource, generating compact yet powerful ensembles that robustly handle OOD shifts and efficiently capture epistemic uncertainty.

Markdown Report Issue Upgrade to Chat

References (2)

Diversify and Disambiguate: Learning From Underspecified Data (2022)

Efficient Epistemic Uncertainty Estimation for Large Language Models via Knowledge Distillation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Data-Diverse Drafts (DDD).

Data-Diverse Drafts (DDD)

1. Conceptual Foundation and Motivation

2. Mathematical Formalism and Objectives

Diversification

Variance Proxies for Uncertainty

3. Algorithmic Implementation

DDD for Underspecification

DDD for LLM Uncertainty

DDD LLM Pseudocode (abbreviated from (Park et al., 2 Feb 2026))

4. Empirical Evaluation and Case Studies

Classification with Underspecification

Epistemic Uncertainty in LLMs

5. Theoretical Properties and Label Efficiency

6. Practical Considerations and Implementation Guidance

7. Limitations and Open Questions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Data-Diverse Drafts (DDD)

1. Conceptual Foundation and Motivation

2. Mathematical Formalism and Objectives

Diversification

Variance Proxies for Uncertainty

3. Algorithmic Implementation

DDD for Underspecification

DDD for LLM Uncertainty

DDD LLM Pseudocode (abbreviated from (Park et al., 2 Feb 2026))

4. Empirical Evaluation and Case Studies

Classification with Underspecification

Epistemic Uncertainty in LLMs

5. Theoretical Properties and Label Efficiency

6. Practical Considerations and Implementation Guidance

7. Limitations and Open Questions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics