Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dirichlet Evidential Distillation

Updated 7 March 2026
  • The paper introduces Dirichlet evidential distillation, transferring teacher ensemble predictions and uncertainty to student models via a Dirichlet output head and LoRA adaptations.
  • It employs a novel loss function that aligns both the predictive mean and variance, enhancing calibration and uncertainty quantification across benchmarks.
  • The method achieves over a 17× inference speedup while maintaining competitive accuracy and robust out-of-distribution detection.

Dirichlet evidential distillation is a methodology for transferring both predictive performance and uncertainty quantification from computationally intensive teacher models—such as prompt-ensembles or Bayesian approaches—to efficient student models in LLM architectures. This approach utilizes evidential learning with a Dirichlet output head and LoRA-based fine-tuning to achieve single-pass, uncertainty-aware inference. The method is designed to replicate not only the predictive mean of the teacher ensemble but also its uncertainty structure, enabling robust evaluation of both in-domain confidence and out-of-distribution (OOD) detection performance, while reducing inference cost by over an order of magnitude (Nemani et al., 24 Jul 2025).

1. Architectural and Optimization Foundations

Both the evidential and standard distillation protocols begin with a shared architecture: a pre-trained LLM, specifically Mistral-7B v0.3, forms the backbone for both teacher and student. The student is updated not by modifying all weights, but by introducing trainable low-rank adapters (LoRA). Each adapter is inserted into both the attention/output projection layers and the final classification head, with the parameterization WW+ΔWW \leftarrow W + \Delta W, where ΔW=BA\Delta W = BA, ARr×kA\in \mathbb{R}^{r \times k}, BRd×rB\in \mathbb{R}^{d \times r}, with rmin(d,k)r \ll \min(d, k). Only {A,B}\{A, B\} are trained during distillation, with all original weights frozen. This achieves fine-tuning in 1–5 epochs and typically requires only ∼0.1% of the full model weights as trainable parameters, significantly reducing memory and compute demands.

2. Dirichlet Output Parameterization and Uncertainty Decomposition

The core innovation of Dirichlet evidential distillation is the replacement of the standard categorical (softmax) output layer with a head that predicts the parameters of a Dirichlet distribution, encoding the full posterior uncertainty. Given logits zcz_c, the model computes concentration parameters as

αc=1+softplus(zc),α0=c=1Kαc.\alpha_c = 1 + \mathrm{softplus}(z_c), \quad \alpha_0 = \sum_{c=1}^K \alpha_c.

Probabilities are then given by the mean of the Dirichlet: E[pc]=αcα0.\mathbb{E}[p_c] = \frac{\alpha_c}{\alpha_0}. The total uncertainty decomposes into:

  • Total predictive entropy:

H[Yx]=cαcα0logαcα0H[Y \mid x] = -\sum_c \frac{\alpha_c}{\alpha_0} \log \frac{\alpha_c}{\alpha_0}

  • Aleatoric entropy (data uncertainty):

Ep[H[Yp]]=c=1Kαcα0[ψ(αc+1)ψ(α0+1)]\mathbb{E}_p[H[Y \mid p]] = -\sum_{c=1}^K \frac{\alpha_c}{\alpha_0}[\psi(\alpha_c + 1) - \psi(\alpha_0 + 1)]

I[Y,px]=H[Yx]Ep[H[Yp]]I[Y, p \mid x] = H[Y \mid x] - \mathbb{E}_p[H[Y \mid p]]

where ψ\psi denotes the digamma function and α0\alpha_0 ("total evidence") modulates the certainty. This evidential framework captures both the confidence and variability of predictions, something softmax-based heads do not natively support.

3. Distillation Losses and Training Objectives

Dirichlet evidential distillation trains the student to match not just the mean but also the variance of the teacher ensemble's predictive distribution. The student is exposed only to the teacher’s output distributions {p(yθn,x)}\{p(y \mid \theta_n, x)\}, not internal states. Given teacher outputs from NN ensemble members or prompt variants with weights wnw_n, the loss function is the teacher-sample negative log-likelihood under the student’s Dirichlet: LDir=1Mi=1M[logΓ(α0(i))c=1KlogΓ(αc(i))+n=1Nwnc=1K(αc(i)1)logp(y=cθn,x(i))]\mathcal{L}_\mathrm{Dir} = -\frac{1}{M}\sum_{i=1}^M \left[ \log\Gamma(\alpha_0^{(i)}) - \sum_{c=1}^K \log\Gamma(\alpha_c^{(i)}) + \sum_{n=1}^N w_n \sum_{c=1}^K (\alpha_c^{(i)}-1) \log p(y=c \mid \theta_n, x^{(i)}) \right] This loss encourages Dirichlet means to align with the teacher’s average prediction and leverages the concentration parameters to match uncertainty. The built-in regularizer, logB(α)-\log B(\alpha), avoids degenerate evidence concentrations. For the standard softmax student, cross-entropy is used: LSoft=1Mi=1Mc=1KpˉT,c(x(i))logσ(zc(i)),\mathcal{L}_\mathrm{Soft} = -\frac{1}{M} \sum_{i=1}^M \sum_{c=1}^K \bar{p}_{\mathcal{T},c}(x^{(i)}) \log \sigma(z_c^{(i)}), with pˉT,c\bar{p}_{\mathcal{T},c} the weighted mean teacher prediction per class.

4. Predictive Distribution Alignment and Theoretical Guarantees

Both Dirichlet and softmax students seek to match the first moment (mean) of the teacher’s predictive, but only the Dirichlet student can also encode the second moment (variance), thus capturing epistemic effects. This is achieved via the shape of the Dirichlet, parameterized by its concentration vector α\alpha. Crucially, there is no need for additional KL-divergence constraints, as the negative log-likelihood of the Dirichlet naturally incorporates regularization. The student is trained solely on black-box teacher outputs—prompt ensemble predictions for each input—and thus inherits the uncertainty decomposition present in the teacher’s sampled distributional spread.

5. Empirical Results: Benchmarking and Efficiency

Experiments on four classification datasets—Amazon Reviews polarity, SST-2, Yahoo Answers, and YouTube Comments—demonstrate the method’s effectiveness. All models use Mistral-7B v0.3 as a backbone, with LoRA adapters of approximate rank 4 fine-tuned on ∼10,000 samples per dataset.

Key empirical observations:

  • Accuracy: The Dirichlet student matches or outperforms the BayesPE teacher, e.g., 0.958 vs. 0.959 on Amazon, with a +1.7 percentage point margin on Yahoo.
  • Calibration: Expected Calibration Error (ECE) is halved relative to BayesPE on Amazon and SST-2; Yahoo sees a drop from 0.194 to 0.042.
  • Uncertainty metrics: The Dirichlet student achieves lower negative log-likelihood (NLL) and Brier scores than both teacher and softmax student in most cases.
  • Training efficiency: Students converge in 1–5 epochs; the softmax student trains marginally faster but both represent negligible overhead next to the cost for teacher ensembles.
  • Inference cost: BayesPE requires multiple prompt passes (up to 17 on Amazon, 4335s), whereas the student (Dirichlet or softmax) needs only a single forward pass (~252s), a ≈17× speedup.
  • Out-of-distribution detection: On OOD evaluation (trained on Amazon, tested on other datasets), the Dirichlet student attains total predictive entropy of 2.156 nats (vs. 0.525 for softmax) on Yahoo, with AUROC ≈0.96 for total uncertainty and ≈0.90 for epistemic uncertainty, outperforming both teacher and softmax student.

6. Ablations, Trade-offs, and Implementation Considerations

Several practical factors influence model deployment and performance:

  • Prompt selection strategy: Distilling from best/average/worst prompts (as ranked by teacher weights) shows calibration sensitivity on noisy datasets like YouTube but negligible on stable tasks like SST-2.
  • α0\alpha_0 scheduling: While learned, sample-wise α0\alpha_0 values concentrate in a narrow range (2–12 on YouTube), using a global fixed α010\alpha_0 \sim 10 can slightly improve ECE or NLL via stronger regularization, though it requires a tuning set.
  • Resource trade-off: LoRA adaptation (0.1% of model weights) enables practical fine-tuning on a single GPU. Dirichlet evidence heads add negligible computational overhead relative to softmax, and retain the advantage of single-pass inference.
  • Implementation tips: Use αc=1+softplus(zc)\alpha_c = 1+\text{softplus}(z_c) to enforce positive concentrations; monitor training NLL with the student’s marginal predictive for early stopping; gather teacher outputs in mixed prompt batches to avoid storage burden; and compute ECE with 10–20 equal-mass bins.

7. Context and Implications

Dirichlet evidential distillation combines advancements in LoRA fine-tuning with theoretically grounded statistical modeling, unifying predictive accuracy and principled uncertainty quantification without the cost of Bayesian ensembles or repeated sampling. Empirical evidence shows that evidential students not only achieve competitive (or improved) calibration and accuracy scores but also exhibit enhanced sensitivity to epistemic uncertainty, which is critical for OOD detection and reliable deployment of LLMs in open-world scenarios. A plausible implication is that evidential distillation can supplant traditional MC or ensemble methods in production settings where efficiency and uncertainty estimates are both essential (Nemani et al., 24 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dirichlet Evidential Distillation.