- The paper demonstrates that last-layer Bayesian inference yields high-quality epistemic uncertainty similar to full-network approaches.
- It uses theoretical analysis with random matrix theory and empirical evaluations on regression, classification, and language modeling tasks.
- Empirical results reveal that the LL-GLM offers competitive UQ with significant improvements in speed and memory efficiency.
Is the Last Layer Sufficient for Uncertainty Quantification?
Background and Motivation
Uncertainty quantification (UQ) in deep neural networks (DNNs) is essential for deployment in mission-critical applications, as it enables the assessment of epistemic uncertainty—uncertainty due to model parameter ambiguity—alongside improved prediction calibration. Bayesian approaches to UQ, including Laplace approximation, variational inference, stochastic weight averaging (SWAG), MC-Dropout, and deep ensembles (DE), have demonstrated varying degrees of efficacy. The primary challenge with most high-fidelity Bayesian UQ methods is their prohibitive computational and memory demands, motivating a quest for tractable yet accurate approximations.
One promising approach is to linearize the DNN around the parameters and apply Bayesian inference to the linearized model, resulting in Bayesian Generalized Linear Models (GLMs). Traditionally, two classes of linearization are considered:
- Full-network linearization (DNN-GLM): Linearizes around all parameters, inducing the Neural Tangent Kernel (NTK).
- Last-layer linearization (LL-GLM): Linearizes only around the parameters of the final connected layer, inducing the Conjugate Kernel (CK).
While last-layer linearization is widely adopted due to its computational tractability, it is frequently assumed to carry a performance cost relative to full-network linearization. This paper rigorously evaluates this assumption from both theoretical and empirical perspectives.
Theoretical Analysis
The central analytic tool is the connection between Bayesian GLMs and Gaussian Processes (GPs): linearizing a DNN yields a GP with a kernel determined by the features used in the linearization. For DNN-GLMs, this kernel is the NTK; for LL-GLMs, it's the CK. The Bayes Free Energy (BFE)—the negative log marginal likelihood—is used as a core metric since it encapsulates both fit and uncertainty calibration, and is asymptotically connected to generalization and PAC-Bayes bounds.
Employing random matrix theory (RMT) under a double-asymptotic regime (layer width and dataset size both →∞ proportionally), the authors precisely characterize limiting BFE for both kernels. The key findings from the theoretical investigation are:
- BFE Comparison: For all regimes except when the sample size far exceeds network width (the highly over-sampled regime), LL-GLMs and DNN-GLMs yield statistically indistinguishable BFE, indicating no intrinsic model-fit or UQ benefit from full-network linearization.
- Robustness and Descent: Both CK and NTK exhibit robustness (insensitivity to input perturbations) and exhibit descent/double-descent phenomena in their BFE curves. However, only the NTK achieves lower BFE in the highly over-sampled (underparameterized) limit.
- Trained vs. Random Networks: Empirical observations suggest that even the minor BFE gap between LL-GLM and DNN-GLM in the over-sampled regime diminishes as networks are trained, pointing to this difference as an artifact of random initialization rather than a fundamental property.
Empirical Evaluation
A comprehensive empirical suite was constructed to validate these theoretical claims and assess practical implications. The paper introduces LinearSampling, an efficient, scalable PyTorch implementation for sampling from GLM posteriors induced by both linearization strategies, facilitating large-scale comparison. The evaluation covered:
- Regression (UCI benchmarks): LL-GLMs performed on par with or occasionally outperformed DNN-GLMs in terms of Expected Calibration Error (ECE) and Negative Log Likelihood (NLL).
- Classification (MNIST, CIFAR-10, CIFAR-100, ImageNet): Across metrics designed for UQ quality—e.g., VARROC, VARROC-MI, and LPPD—LL-GLMs matched DNN-GLM performance, also yielding tight predictive uncertainty and competitive out-of-distribution (OOD) detection.
- Language Modeling (GPT-2, IMDB): On a LLM finetuned for classification, there was no statistically significant difference in UQ quality between the two linearization schemes. However, LL-GLMs yielded a 10x speedup and 3.5x reduction in memory usage.
Notably, across all domains, both LL-GLM and DNN-GLM rivalled the computationally demanding deep ensembles (DE) in UQ performance, but at a fraction of computational and resource cost.
Technical Insights and Claims
- Computation-Efficiency vs. Fidelity: The LL-GLM achieves equivalent epistemic UQ to DNN-GLM at a dramatically reduced computational burden.
- Nature of Epistemic Uncertainty: The results suggest that most epistemic uncertainty in DNNs arises from how learned features are used by the final linear layer to map to outputs, not from uncertainty in the representation itself. This directly challenges the frequently articulated concern that last-layer approximations under-represent uncertainty propagated from earlier layers.
- Inductive Representations: Empirical analysis of singular value spectra illustrates that post-training, final-layer feature representations are often nearly rank-deficient, supporting efficient subspace-based posterior sampling and justifying the tractability of LL-GLM even in high sample/parameter regimes.
- Broader UQ Metrics: Careful metric selection is critical: conventional accuracy or NLL-based assessments may mischaracterize UQ quality. The VARROC/VARROC-MI metrics, which measure the ability of uncertainty estimates to distinguish between correctly and incorrectly classified or OOD samples, are demonstrated to be both practical and discriminative.
Implications and Future Directions
This work has significant practical implications: it indicates that for UQ in modern DNNs, restricting Bayesian inference to the last layer is sufficient for high-quality epistemic uncertainty estimates, obviating the need for full-network Bayesianization in most realistic settings. This finding will likely increase the adoption of Bayesian UQ in deployment scenarios—especially those constrained by computational or memory resources—by lowering the technical barrier.
The authors note key theoretical limitations—chiefly, the analysis pertains strictly to randomly initialized networks in the infinite-width regime, so the full generality of claims for SGD/Adam-based training dynamics and practical dataset distributions remain open research directions. Extending the framework to jointly model aleatoric (data) uncertainty and developing scalable MCMC methods for posterior estimation in the last-layer regime present further opportunities.
Conclusion
The paper decisively refutes the standard assumption that last-layer Bayesian linearization sacrifices UQ performance relative to full-network linearization in DNNs. Analytical, empirical, and implementation evidence indicate that the LL-GLM is as effective for epistemic UQ as the DNN-GLM but is vastly more computationally efficient. Theoretical and practical advances presented here motivate a clear methodological shift: last-layer Bayesian approaches suffice for state-of-the-art UQ in deep learning, catalyzing broader adoption and deeper theoretical inquiry into uncertainty in representation learning.
Reference:
"Is the Last Layer Sufficient for Uncertainty Quantification?" (2605.30741)