Papers
Topics
Authors
Recent
Search
2000 character limit reached

Data-Diverse Drafts Strategy for LLM Uncertainty

Updated 9 February 2026
  • The DDD strategy constructs diverse ensembles of lightweight draft models by partitioning synthetic data to estimate epistemic uncertainty in LLMs.
  • It leverages bias–variance decomposition and Jensen-Shannon divergence as variance proxies for more accurate token-level uncertainty estimation.
  • DDD reduces inference cost and improves hallucination detection by efficiently approximating posterior predictive distributions via Online Stochastic Distillation.

Data-Diverse Drafts (DDD) strategy is an epistemic uncertainty estimation method designed for LLMs, addressing the computational infeasibility of direct ensembling at modern model scales. DDD constructs an ensemble of lightweight “draft” models, each trained on synthetic data partitions to maximize diversity among their predictive distributions. This diversity enables more accurate and efficient token-level EU estimation by leveraging theoretical bias–variance decomposition and practical knowledge distillation mechanisms, such as Online Stochastic Distillation (OSD), while maintaining minimal inference overhead (Park et al., 2 Feb 2026).

1. Theoretical Underpinnings and Motivation

Epistemic uncertainty (EU) in LLMs quantifies mutual information between predicted outputs and model parameters. Formally, for a target model pTp_T,

EU=H(pT)EθπT[H(pθ)]=EθπT[KL(pθpT)],\text{EU} = H(p_T) - E_{\theta \sim \pi_T}[H(p_\theta)] = E_{\theta \sim \pi_T}[KL(p_\theta \| p_T)],

where pθp_\theta denotes a sample from the posterior πT\pi_T over model weights and KL()KL(\cdot\|\cdot) is the Kullback-Leibler divergence. Direct computation or deep ensembling of pTp_T is prohibitive due to model size.

Speculative decoding uses compact “draft” models, q1,,qKq_1, \ldots, q_K, to approximate the predictive distribution of pTp_T. However, naive training leads to mode collapse—drafts concentrate on a single mode of the target, resulting in negligible inter-draft disagreement and overconfident EU estimates. DDD resolves this by enforcing draft diversity through disjoint data subsets, promoting coverage of distinct “views” of the target’s posterior predictive landscape. The variance among draft outputs (measured by Jensen-Shannon divergence) thus serves as a proxy for epistemic uncertainty, with data partitioning ensuring that the approximation does not collapse when the target’s support is large or multimodal (Park et al., 2 Feb 2026).

2. Formal Definitions and Bias–Variance Decomposition

Central to DDD is a bias–variance decomposition enabled by the following constructs:

  • Draft mixture: qmix(y)=1Kk=1Kqk(y)q_{\mathrm{mix}}(y) = \frac{1}{K} \sum_{k=1}^K q_k(y)
  • Jensen-Shannon Divergence (variance proxy):

JSD(q1,,qK)=1Kk=1KKL(qkqmix)\mathrm{JSD}(q_1, \ldots, q_K) = \frac{1}{K} \sum_{k=1}^K KL(q_k \| q_{\text{mix}})

  • Bias proxy: KL(qmixpT)KL(q_{\mathrm{mix}} \| p_T)

Combining these, the expectation over drafts satisfies:

Ek[KL(qkpT)]=JSD(q1,,qK)+KL(qmixpT)E_k[KL(q_k \| p_T)] = \mathrm{JSD}(q_1, \ldots, q_K) + KL(q_{\mathrm{mix}} \| p_T)

Under the Proxy Posterior Assumption (EθπT[]EkUniform(1,K)[]E_{\theta \sim \pi_T}[\cdot] \approx E_{k \sim \mathrm{Uniform}(1, K)}[\cdot]), these terms yield an efficient, bias–variance resolved estimator of EU. Data diversity directly increases the JSD term, critically improving variance-based EU estimation, particularly for out-of-distribution queries (Park et al., 2 Feb 2026).

3. DDD Construction and Training Procedure

The DDD algorithm proceeds as follows:

  1. Synthetic Data Preparation: Given a target LLM pTp_T (e.g., Llama-8B) and a base dataset (GSM8K), generate a synthetic corpus D={(xi,{yi(1),,yi(M)})}D = \{(x_i, \{y_i^{(1)}, \ldots, y_i^{(M)}\})\}, where each xix_i is expanded to MM target-generated outputs via low-rank noise injection.
  2. Partitioning: Divide DD into SS disjoint partitions D1,,DSD_1, \ldots, D_S of equal size but distinct target responses.
  3. Draft Model Initialization and Training: For each partition s=1Ss = 1\ldots S and m=1Mm = 1\ldots M, initialize a draft q(s,m)q_{(s, m)} from a fixed student checkpoint. Train each draft on DsD_s using Online Stochastic Distillation (OSD) to minimize

LOSD=E(x,y)DsEθπT[KL(pθ(x)q(s,m)(x))]L_{\mathrm{OSD}} = E_{(x, y) \in D_s} E_{\theta\sim\pi_T}[KL(p_\theta(\cdot|x) \| q_{(s,m)}(\cdot|x))]

No noise injection is applied during initialization for DDD; diversity is achieved via data partitioning alone.

  1. Inference and EU Computation: For an input xx:
    • Forward xx through all K=SMK = S \cdot M drafts to obtain q1,...,qKq_1, ..., q_K.
    • Compute qmixq_{\text{mix}} and the variance proxy (JSD).
    • Compute the bias proxy via KL(qmixpmix)KL(q_{\text{mix}} \| p_{\text{mix}}), where pmixp_{\text{mix}} is an OSD-trained, single-model proxy for pTp_T.
    • Sum proxies to yield token-level EU scores.

A summary of implementation-specific hyperparameters appears below.

Setting Value (typical) Comment
Data Partitions (SS) 2
Drafts per Partition (MM) 3 K=6K = 6
Model Size (drafts) 1B or 3B
Synthetic Dataset GSM8K (8500×48500 \times 4 TGT answers)
DDD vs. Baseline DDD RMSE: 0.2036; Baseline RMSE: 0.3266 (8B3\rightarrow3B) –37.7% RMSE reduction (Park et al., 2 Feb 2026)

4. Integration with Online Stochastic Distillation (OSD)

Online Stochastic Distillation is a mechanism for approximating the Bayesian model average pTp_T with a single proxy pmixp_{\text{mix}}. The OSD loss function is:

LOSD(ϕ)=ExDEθπT[KL(pθ(x)pmix(x;ϕ))]L_{\mathrm{OSD}}(\phi) = E_{x\in D} E_{\theta\sim\pi_T}\left[ KL(p_\theta(\cdot|x) \| p_{\text{mix}}(\cdot|x; \phi)) \right]

At each mini-batch, low-rank noise is injected stochastically into pTp_T's parameters, and pmixp_{\text{mix}} is trained to match the corresponding stochastic output. Over the OSD training process, pmixp_{\text{mix}} converges to pTp_T. During inference, KL divergences for the bias-proxy term are computed as KL(qmixpmix)KL(q_{\text{mix}} \| p_{\text{mix}}) rather than KL(qmixpT)KL(q_{\text{mix}} \| p_T), eliminating the need for expensive multi-pass evaluation of pTp_T (Park et al., 2 Feb 2026).

5. Comparative Performance and Empirical Evaluation

DDD demonstrates marked improvements in EU estimation and downstream hallucination detection at significantly reduced computational cost. For GSM8K:

  • Uncertainty Estimation: DDD decreases RMSE by 37.7% over the draft baseline (0.2036 vs. 0.3266 at 8B\rightarrow3B, K=6K=6) and exhibits high rank correlation (Spearman 0.9165). At 8B\rightarrow1B, DDD achieves a 22.4% reduction in RMSE relative to baseline.
  • Hallucination Detection: DDD matches or slightly exceeds the AUROC, ECE, and Brier scores of perturbation-based methods such as TokUR (AUROC 0.7839 vs. 0.7823; ECE 0.0576 vs. 0.0652) while incurring only 0.75×\approx0.75\times the inference cost for 3B drafts.
  • Ablations: Performance drops if data partitioning is omitted or replaced with only parameter noise; explicit splitting (e.g., 2×\times3) is essential for maximizing draft diversity and thus the informativeness of the variance proxy.

6. Significance, Limitations, and Broader Implications

Data-Diverse Drafts strategy offers a scalable, low-overhead alternative for uncertainty quantification in LLMs. By systematically maximizing diversity between draft models, DDD circumvents the inherent pitfalls of mode collapse in ensemble approximations and delivers sharp, bias–variance decomposed EU estimates. The approach enables competitive hallucination detection at minimal cost and is amenable to integration across autoregressive LLM pipelines requiring risk-aware outputs. Empirical analysis confirms the necessity of data-driven partitioning for draft construction, as noise-based or “K only” ensemble approaches underperform in both accuracy and reliability.

A plausible implication is that the DDD paradigm may generalize to other distributional approximation tasks where support coverage and tractable variance estimation are essential. However, DDD’s efficacy is sensitive to both the number of data partitions and the level of diversity inherent in the synthetic target-generated data. Further investigation is needed to characterize the strategy’s robustness across different LLM architectures, data domains, and levels of target model uncertainty (Park et al., 2 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Data-Diverse Drafts (DDD) Strategy.