Papers
Topics
Authors
Recent
2000 character limit reached

Bayesian Self-Distillation (BSD)

Updated 2 January 2026
  • Bayesian Self-Distillation is a method that combines Bayesian inference with self-distillation to distill posterior expectations into robust student models.
  • It leverages Monte Carlo sampling and Dirichlet-based updates to achieve improved calibration, accuracy, and uncertainty estimation in tasks like image classification.
  • BSD boosts model robustness and efficiency in various applications, including deep neural networks and large language models, by transferring structured uncertainty information.

Bayesian Self-Distillation (BSD) is a class of principled machine learning methods that integrate Bayesian inference with self-distillation, enabling models to distill structured, uncertainty-aware knowledge from themselves or Bayesian ensembles. BSD systematically leverages posterior expectations—often beyond the predictive mean—to produce more informative, calibrated, and robust models suitable for modern deep learning applications ranging from image classification to LLM uncertainty estimation.

1. Bayesian Self-Distillation: Core Concepts and Definitions

BSD encompasses frameworks where a student model is trained to approximate expectations under the Bayesian posterior. Let D={(xi,yi)}i=1N\mathcal{D} = \{(x_i, y_i)\}_{i=1}^N denote training data, θRP\theta\in\mathbb{R}^P be model parameters with prior p(θ)p(\theta), and p(Dθ)p(\mathcal{D}|\theta) the data likelihood. The posterior is given by: p(θD)=p(Dθ)p(θ)p(Dθ)p(θ)dθp(\theta|\mathcal{D}) = \frac{p(\mathcal{D}|\theta)\,p(\theta)}{\int p(\mathcal{D}|\theta')\,p(\theta')\,d\theta'} BSD aims to distill knowledge encoded in posterior expectations of functions T(θ,x)T(\theta, x): Eθp(θD)[T(θ,x)]=T(θ,x)p(θD)dθ\mathbb{E}_{\theta\sim p(\theta|\mathcal{D})}[T(\theta,x)] = \int T(\theta, x) p(\theta|\mathcal{D}) d\theta This expectation is typically intractable; practical BSD approximates it via Monte Carlo sampling, with MM samples θ1,,θMp(θD)\theta_1, \dots, \theta_M \sim p(\theta|\mathcal{D}): t^M(x)=1Mm=1MT(θm,x)\hat t_M(x) = \frac{1}{M}\sum_{m=1}^M T(\theta_m, x) A student network Sϕ(x)S_\phi(x) is then trained to minimize the mean-squared error or cross-entropy (depending on TT) to t^M(x)\hat t_M(x) over a dataset D\mathcal{D}' (usually D\mathcal{D} or its augmentation) (Vadera et al., 2020).

2. Methodological Variants in BSD

BSD supports a range of targets TT, algorithms, and domain specializations:

a. Posterior Predictive and Expected Entropy

  • Posterior predictive: T(θ,x)=p(yx,θ)T(\theta, x) = p(y|x, \theta), student trained to approximate predictive mean via cross-entropy.
  • Expected entropy: T(θ,x)=yp(yx,θ)logp(yx,θ)T(\theta, x) = -\sum_y p(y|x, \theta) \log p(y|x, \theta); the student mimics expected uncertainty via regression (Vadera et al., 2020).

b. Bayesian Self-Distillation for Image Classification

BSD can remove dependence on hard targets after initialization. Each sample’s latent class is modeled with a Dirichlet prior; posterior targets are iteratively refined via discounted aggregation of model predictions: αit=γαit1+y^it,yit=αitAit\alpha_i^t = \gamma\,\alpha_i^{t-1} + \hat y_i^t, \quad y_i^t = \frac{\alpha_i^t}{A_i^t} This per-sample target is used for cross-entropy and updated each epoch, interpolating between new evidence and accumulated target mass (Adelöw et al., 30 Dec 2025).

c. Efficient Bayesian LLM Distillation

For LLMs, the Bayesian teacher is marginalized over parameter samples. The student aligns to the teacher’s predictive via KL divergence: LKD(ϕ)=E(x,y)Dtrain[KL(pB(x)pS(x;ϕ))]\mathcal{L}_{\mathrm{KD}}(\phi) = \mathbb{E}_{(x, y^*) \sim \mathcal{D}_{\mathrm{train}}} \Bigl[\mathrm{KL}(p_B(\cdot | x) \| p_S(\cdot | x; \phi))\Bigr] With curriculum mixing of ground-truth CE loss, distillation can fully transfer calibration and uncertainty to a deterministic student LLM (Vejendla et al., 16 May 2025).

d. Hyperparameter Transfer: BOSS Framework

“BOSS” instantiates BSD by combining Bayesian optimization (for hyperparameters) with self-distillation (transferring feature/logit-level knowledge across top-performing models) in a bi-level search procedure (Lee et al., 2023).

3. Algorithmic Procedures and Implementation

BSD implementations generally follow the cycle:

  1. Posterior Sampling: Draw (possibly approximate) parameter samples from p(θD)p(\theta|\mathcal{D}) (e.g., via SGLD, MC Dropout, deep ensembles).
  2. Online Target Aggregation: Compute Monte Carlo or Bayesian-updated targets, optionally with memory discounting.
  3. Student Update: Optimize student parameters by minimizing deviation from estimated Bayesian targets.
  4. Repeat/Iterate: Continue until convergence or for a fixed number of epochs/sweeps.

Specific instance pseudocode is available for neural networks (Vadera et al., 2020) and for per-sample Dirichlet-based updates (Adelöw et al., 30 Dec 2025).

In the LLM context, teacher ensemble outputs are precomputed for all data examples, and the student is trained only on this knowledge, without the need for a held-out validation set (Vejendla et al., 16 May 2025).

4. Empirical Performance and Practical Impact

Empirical evaluation across benchmarks demonstrates the following:

Image Classification & Uncertainty Estimation

  • BSD achieves increased test accuracy and calibration. For ResNet-50 on CIFAR-100, accuracy improves from 75.82% (baseline) to 79.09%, ECE from 20.41% to 7.17% (Adelöw et al., 30 Dec 2025).
  • BSD students match or exceed Bayesian ensembles for object uncertainty ranking (nDCG@20 within 1–3%) and OOD detection (AUROC within 2–5%) (Vadera et al., 2020).

Label Noise and Robustness

  • Under symmetric/asymmetric label noise, BSD outperforms MixUp, label smoothing, and other single-stage approaches, e.g., 50% symmetric noise: 81.49% → 90.71% for ResNet-18 (Adelöw et al., 30 Dec 2025).
  • Adding contrastive self-distillation (BSD⁺) further improves robustness in noisy-label tasks.

LLMs

  • EUD (efficient uncertainty distillation) achieves test-time speedup proportional to the number of teacher samples (N×N\times reduction), with accuracy and calibration (ECE/NLL) competitive with or superior to MC-sampled Bayesian LLMs (Vejendla et al., 16 May 2025).

Hyperparameter Optimization

  • BOSS consistently beats both pure Bayesian optimization and standard self-distillation across architectures and tasks (including noisy-label and semi-supervised settings). Gains are robust to search/regularization hyperparameters (Lee et al., 2023).

5. Student Architectures and Search Strategies

BSD allows flexibility in student design:

  • Width/kernels multipliers: Systematic search over scaling factors for layer widths/kernels, enabling Pareto-optimal trade-offs between inference cost and test accuracy (Vadera et al., 2020).
  • Group sparse regularization / pruning: Over-complete students are regularized for group-wise sparsity; after training, groups below a norm threshold are pruned and fine-tuned to yield compact architectures (Vadera et al., 2020).
  • Architecture preservation: In some BSD variants, the student architecture is kept fixed (as in Dirichlet BSD and EUD, to ensure fair comparison or deployment parity) (Adelöw et al., 30 Dec 2025, Vejendla et al., 16 May 2025).

6. Extensions, Limitations, and Theoretical Properties

Generalization and Transfer

Limitations

Connections to Prior Methods

  • When the Bayesian aggregation parameter γ=0\gamma=0 (Dirichlet BSD), the update reduces to progressive self-knowledge distillation (PS-KD).
  • For cc\to\infty, training is equivalent to hard targets. Fixing the initialization at the Dirichlet fixed point recovers label-smoothing or exponential moving average (EMA) approaches (Adelöw et al., 30 Dec 2025).
  • BOSS generalizes previous hyperparameter and model knowledge transfer; gains are maintained for various surrogate models, loss forms, and data regimes (Lee et al., 2023).

7. Summary Table: BSD Variant Landscape

BSD Variant Target Knowledge Architecture Flexibility Main Empirical Gains
Posterior Expectation Distillation (Vadera et al., 2020) Predictive mean, entropy Flexible, auto-searched Calibrated, light students
Dirichlet Bayesian Self-Distillation (Adelöw et al., 30 Dec 2025) Per-sample soft target Fixed SOTA calibration/robustness
Efficient Uncertainty Distillation (Vejendla et al., 16 May 2025) LLM predictive mean LoRA student Sampling-free uncertainty
BOSS (Bayesian Opt + SD) (Lee et al., 2023) Feature/logit-level Any CNN Persistent hyperparam gains

References

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Bayesian Self-Distillation (BSD).