Data-Diverse Drafts Strategy for LLM Uncertainty

Updated 9 February 2026

The DDD strategy constructs diverse ensembles of lightweight draft models by partitioning synthetic data to estimate epistemic uncertainty in LLMs.
It leverages bias–variance decomposition and Jensen-Shannon divergence as variance proxies for more accurate token-level uncertainty estimation.
DDD reduces inference cost and improves hallucination detection by efficiently approximating posterior predictive distributions via Online Stochastic Distillation.

Data-Diverse Drafts (DDD) strategy is an epistemic uncertainty estimation method designed for LLMs, addressing the computational infeasibility of direct ensembling at modern model scales. DDD constructs an ensemble of lightweight “draft” models, each trained on synthetic data partitions to maximize diversity among their predictive distributions. This diversity enables more accurate and efficient token-level EU estimation by leveraging theoretical bias–variance decomposition and practical knowledge distillation mechanisms, such as Online Stochastic Distillation (OSD), while maintaining minimal inference overhead (Park et al., 2 Feb 2026).

1. Theoretical Underpinnings and Motivation

Epistemic uncertainty (EU) in LLMs quantifies mutual information between predicted outputs and model parameters. Formally, for a target model $p_T$ ,

$\text{EU} = H(p_T) - E_{\theta \sim \pi_T}[H(p_\theta)] = E_{\theta \sim \pi_T}[KL(p_\theta \| p_T)],$

where $p_\theta$ denotes a sample from the posterior $\pi_T$ over model weights and $KL(\cdot\|\cdot)$ is the Kullback-Leibler divergence. Direct computation or deep ensembling of $p_T$ is prohibitive due to model size.

Speculative decoding uses compact “draft” models, $q_1, \ldots, q_K$ , to approximate the predictive distribution of $p_T$ . However, naive training leads to mode collapse—drafts concentrate on a single mode of the target, resulting in negligible inter-draft disagreement and overconfident EU estimates. DDD resolves this by enforcing draft diversity through disjoint data subsets, promoting coverage of distinct “views” of the target’s posterior predictive landscape. The variance among draft outputs (measured by Jensen-Shannon divergence) thus serves as a proxy for epistemic uncertainty, with data partitioning ensuring that the approximation does not collapse when the target’s support is large or multimodal (Park et al., 2 Feb 2026).

2. Formal Definitions and Bias–Variance Decomposition

Central to DDD is a bias–variance decomposition enabled by the following constructs:

Draft mixture: $q_{\mathrm{mix}}(y) = \frac{1}{K} \sum_{k=1}^K q_k(y)$
Jensen-Shannon Divergence (variance proxy):

$\mathrm{JSD}(q_1, \ldots, q_K) = \frac{1}{K} \sum_{k=1}^K KL(q_k \| q_{\text{mix}})$

Bias proxy: $KL(q_{\mathrm{mix}} \| p_T)$

Combining these, the expectation over drafts satisfies:

$E_k[KL(q_k \| p_T)] = \mathrm{JSD}(q_1, \ldots, q_K) + KL(q_{\mathrm{mix}} \| p_T)$

Under the Proxy Posterior Assumption ( $E_{\theta \sim \pi_T}[\cdot] \approx E_{k \sim \mathrm{Uniform}(1, K)}[\cdot]$ ), these terms yield an efficient, bias–variance resolved estimator of EU. Data diversity directly increases the JSD term, critically improving variance-based EU estimation, particularly for out-of-distribution queries (Park et al., 2 Feb 2026).

3. DDD Construction and Training Procedure

The DDD algorithm proceeds as follows:

Synthetic Data Preparation: Given a target LLM $p_T$ (e.g., Llama-8B) and a base dataset (GSM8K), generate a synthetic corpus $D = \{(x_i, \{y_i^{(1)}, \ldots, y_i^{(M)}\})\}$ , where each $x_i$ is expanded to $M$ target-generated outputs via low-rank noise injection.
Partitioning: Divide $D$ into $S$ disjoint partitions $D_1, \ldots, D_S$ of equal size but distinct target responses.
Draft Model Initialization and Training: For each partition $s = 1\ldots S$ and $m = 1\ldots M$ , initialize a draft $q_{(s, m)}$ from a fixed student checkpoint. Train each draft on $D_s$ using Online Stochastic Distillation (OSD) to minimize

$L_{\mathrm{OSD}} = E_{(x, y) \in D_s} E_{\theta\sim\pi_T}[KL(p_\theta(\cdot|x) \| q_{(s,m)}(\cdot|x))]$

No noise injection is applied during initialization for DDD; diversity is achieved via data partitioning alone.

Inference and EU Computation: For an input $x$ $x$ :
- Forward $x$ through all $K = S \cdot M$ drafts to obtain $q_1, ..., q_K$ .
- Compute $q_{\text{mix}}$ and the variance proxy (JSD).
- Compute the bias proxy via $KL(q_{\text{mix}} \| p_{\text{mix}})$ , where $p_{\text{mix}}$ is an OSD-trained, single-model proxy for $p_T$ .
- Sum proxies to yield token-level EU scores.

A summary of implementation-specific hyperparameters appears below.

Setting	Value (typical)	Comment
Data Partitions ( $S$ )	2
Drafts per Partition ( $M$ )	3	$K = 6$
Model Size (drafts)	1B or 3B
Synthetic Dataset	GSM8K ( $8500 \times 4$ TGT answers)
DDD vs. Baseline	DDD RMSE: 0.2036; Baseline RMSE: 0.3266 (8B $\rightarrow3$ B)	–37.7% RMSE reduction (Park et al., 2 Feb 2026)

4. Integration with Online Stochastic Distillation (OSD)

Online Stochastic Distillation is a mechanism for approximating the Bayesian model average $p_T$ with a single proxy $p_{\text{mix}}$ . The OSD loss function is:

$L_{\mathrm{OSD}}(\phi) = E_{x\in D} E_{\theta\sim\pi_T}\left[ KL(p_\theta(\cdot|x) \| p_{\text{mix}}(\cdot|x; \phi)) \right]$

At each mini-batch, low-rank noise is injected stochastically into $p_T$ 's parameters, and $p_{\text{mix}}$ is trained to match the corresponding stochastic output. Over the OSD training process, $p_{\text{mix}}$ converges to $p_T$ . During inference, KL divergences for the bias-proxy term are computed as $KL(q_{\text{mix}} \| p_{\text{mix}})$ rather than $KL(q_{\text{mix}} \| p_T)$ , eliminating the need for expensive multi-pass evaluation of $p_T$ (Park et al., 2 Feb 2026).

5. Comparative Performance and Empirical Evaluation

DDD demonstrates marked improvements in EU estimation and downstream hallucination detection at significantly reduced computational cost. For GSM8K:

Uncertainty Estimation: DDD decreases RMSE by 37.7% over the draft baseline (0.2036 vs. 0.3266 at 8B $\rightarrow$ 3B, $K=6$ ) and exhibits high rank correlation (Spearman 0.9165). At 8B $\rightarrow$ 1B, DDD achieves a 22.4% reduction in RMSE relative to baseline.
Hallucination Detection: DDD matches or slightly exceeds the AUROC, ECE, and Brier scores of perturbation-based methods such as TokUR (AUROC 0.7839 vs. 0.7823; ECE 0.0576 vs. 0.0652) while incurring only $\approx0.75\times$ the inference cost for 3B drafts.
Ablations: Performance drops if data partitioning is omitted or replaced with only parameter noise; explicit splitting (e.g., 2 $\times$ 3) is essential for maximizing draft diversity and thus the informativeness of the variance proxy.

6. Significance, Limitations, and Broader Implications

Data-Diverse Drafts strategy offers a scalable, low-overhead alternative for uncertainty quantification in LLMs. By systematically maximizing diversity between draft models, DDD circumvents the inherent pitfalls of mode collapse in ensemble approximations and delivers sharp, bias–variance decomposed EU estimates. The approach enables competitive hallucination detection at minimal cost and is amenable to integration across autoregressive LLM pipelines requiring risk-aware outputs. Empirical analysis confirms the necessity of data-driven partitioning for draft construction, as noise-based or “K only” ensemble approaches underperform in both accuracy and reliability.

A plausible implication is that the DDD paradigm may generalize to other distributional approximation tasks where support coverage and tractable variance estimation are essential. However, DDD’s efficacy is sensitive to both the number of data partitions and the level of diversity inherent in the synthetic target-generated data. Further investigation is needed to characterize the strategy’s robustness across different LLM architectures, data domains, and levels of target model uncertainty (Park et al., 2 Feb 2026).

Markdown Upgrade to Chat

References (1)

Efficient Epistemic Uncertainty Estimation for Large Language Models via Knowledge Distillation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Data-Diverse Drafts (DDD) Strategy.

Data-Diverse Drafts Strategy for LLM Uncertainty

1. Theoretical Underpinnings and Motivation

2. Formal Definitions and Bias–Variance Decomposition

3. DDD Construction and Training Procedure

4. Integration with Online Stochastic Distillation (OSD)

5. Comparative Performance and Empirical Evaluation

6. Significance, Limitations, and Broader Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Data-Diverse Drafts Strategy for LLM Uncertainty

1. Theoretical Underpinnings and Motivation

2. Formal Definitions and Bias–Variance Decomposition

3. DDD Construction and Training Procedure

4. Integration with Online Stochastic Distillation (OSD)

5. Comparative Performance and Empirical Evaluation

6. Significance, Limitations, and Broader Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research