Bayesian Non-negative Decision Layer (BNDL)
- BNDL is a variational Bayesian decision layer that uses nonnegative factor analysis to replace the conventional softmax layer.
- It leverages Weibull variational approximations for efficient training and closed-form KL divergences, ensuring robust uncertainty estimation.
- BNDL enhances model interpretability and disentanglement while achieving competitive or superior accuracy on benchmark vision datasets.
The Bayesian Non-negative Decision Layer (BNDL) is a variational Bayesian approach to the neural network decision layer, formulated as a conditional Bayesian non-negative factor analysis. BNDL addresses limitations in standard deep neural network @@@@1@@@@ and interpretability by introducing a fully Bayesian, nonnegative, and sparse final layer. Leveraging stochastic latent variables governed by non-negative support and sparse priors, BNDL provides robust uncertainty quantification and theoretical guarantees for disentangled representations. BNDL incorporates a tailored variational inference scheme using Weibull variational distributions, yielding closed-form KL divergences and efficient end-to-end training. Empirical evaluations demonstrate improvements in both uncertainty calibration and local interpretability, together with competitive or superior classification accuracy on benchmark vision datasets (Hu et al., 28 May 2025).
1. Model Architecture and Probabilistic Formulation
BNDL replaces the conventional softmax linear layer in deep networks with a conditional Bayesian non-negative factor model. The generative process is:
- For each observation :
- Features are extracted by a deterministic (often deep) network.
- Local latent factors are sampled as .
- Global factor loading matrix is assigned an elementwise prior .
- The categorical output is modeled with , so the unnormalized logit for class is .
This yields a hierarchical model:
The joint over data is thus a conditional nonnegative matrix factorization: with . Practically, the standard deterministic feature extractor parameterized by is followed by a Bayesian factorization layer.
2. Variational Inference and Optimization
Exact Bayesian inference in this model is intractable. BNDL addresses this by using Weibull variational approximations for both local and global latent variables:
- The variational posterior for is , with and parameters produced by neural networks from the extracted feature . The parameterizations are , .
- The variational posterior for the global factor is , with global free parameters optimized by SGD.
Weibull distributions are chosen for their compatibility with the nonnegativity constraint and enable reparameterized gradients. The KL divergence between Weibull and Gamma admits a closed-form:
where is Euler's constant. The evidence lower bound (ELBO) optimized during training is
All learnable parameters, including the feature extractor , variational parameters , and global parameters , are optimized via stochastic gradient ascent with reparameterization for sampling.
3. Inductive Biases: Sparsity, Non-negativity, and Disentanglement
The non-negativity of both and follows from their Gamma/Weibull priors. Sparsity in the factors is induced via small shape parameters in the Gamma prior and reinforced by ReLU thresholding in the final layer weights: with a small hyperparameter.
Disentanglement is theoretically supported by a partial identifiability guarantee from nonnegative matrix factorization (NMF):
- If there exists a "selective window" (a row of with only one strong activation for a factor) and a "sparsity constraint" (columns of have at least zeros for rank ), then the -th column of is uniquely identifiable (up to permutation and scaling) [Gillis 2023]. In BNDL, the architecture and priors make these sparsity and selective-activation conditions likely to hold in practice, promoting disentangled class concepts in (Hu et al., 28 May 2025).
4. Uncertainty Estimation and Model Interpretability
Aleatoric uncertainty is captured by sampling from at test time, reflecting input-driven variability. Epistemic uncertainty is modeled by sampling from . A single forward pass produces stochastic logits for each draw. Aggregating such passes provides a predictive empirical distribution
Model uncertainty is quantified by performing a two-sample t-test on the top-two class scores over the ensemble, declaring uncertainty if . The Patch-Accuracy vs. Patch-Uncertainty (PAvPU) metric quantifies uncertainty quality:
where ("accurate-certain") and ("inaccurate-uncertain") favor both reliability in classification and in uncertainty declarations.
Interpretability is enhanced by the non-negativity and sparsity of , so each class aligns with an additive “concept.” Local explanations generate heatmaps by (i) selecting the latent factor with largest for input , and (ii) applying LIME to regress the superpixel map of onto 's activation, thus isolating the pixels most responsible for activations of each "concept." Non-negativity ensures feature maps do not cancel, directly mapping activations to human-interpretable image parts.
5. Empirical Evaluation
BNDL demonstrates robust performance on CIFAR-10, CIFAR-100, and ImageNet-1k using ResNet-18, ResNet-50, and ViT backbones. Comparative results (accuracy and PAvPU) are summarized below:
| Dataset | Model | ACC (%) | PAvPU (%) |
|---|---|---|---|
| CIFAR-10 | ResNet-BNDL | 95.54±0.08 | 95.58±0.20 |
| MC Dropout | 94.54±0.03 | 78.83±0.12 | |
| BM | 94.07±0.07 | 93.98±0.30 | |
| CARD | 90.93±0.02 | 91.11±0.04 | |
| CIFAR-100 | ResNet-BNDL | 79.82±0.13 | 81.10±0.21 |
| MC Dropout | 78.12±0.06 | 64.41±0.22 | |
| BM | 75.81±0.34 | 77.13±0.67 | |
| CARD | 71.42±0.01 | 71.48±0.03 | |
| ImageNet-1k | ResNet-BNDL | 77.01±0.14 | 77.66±0.03 |
| MC Dropout | 75.98±0.08 | 76.50±0.02 | |
| CARD | 76.20±0.00 | 76.29±0.01 |
Disentanglement is quantified using SEPIN@k metrics. For ImageNet-1k/ResNet-50, ResNet-BNDL achieves higher SEPIN scores at all compared to ResNet-50:
| Metric | SEPIN@1 | SEPIN@10 | SEPIN@100 | SEPIN@1000 | SEPIN@all |
|---|---|---|---|---|---|
| ResNet-50 | 1.50±0.02 | 1.03±0.01 | 0.60±0.01 | 0.31±0.01 | 0.23±0.01 |
| ResNet-BNDL | 2.59±0.03 | 2.12±0.01 | 1.30±0.01 | 0.65±0.01 | 0.44±0.01 |
BNDL achieves zero weights in the final layer with negligible loss in accuracy; for ImageNet, 75.7% accuracy is retained at 0.24% weight density. Qualitative LIME and Grad-CAM analyses demonstrate that BNDL’s factors localize to semantically meaningful image parts more precisely than standard ResNets, reducing multifaceted, entangled activations (Hu et al., 28 May 2025).
6. Summary and Implications
BNDL provides an efficient Bayesian decision layer for deep networks with:
- Well-calibrated single-pass uncertainty estimation via stochastic latent factors and closed-form KL divergences.
- Strong sparsity and non-negativity imposed by Gamma/Weibull priors and explicit thresholding.
- Theoretical guarantees for disentangled and interpretable decision representations under mild identifiability conditions.
- Empirical gains in calibration, interpretability, and accuracy across large-scale vision benchmarks.
A plausible implication is that replacing the standard softmax layer with a Bayesian non-negative structure such as BNDL enables generic deep feature extractors to yield more explainable, robust, and trustworthy predictions without sacrificing accuracy (Hu et al., 28 May 2025).