Bayesian Non-negative Decision Layer (BNDL)

Updated 7 February 2026

BNDL is a variational Bayesian decision layer that uses nonnegative factor analysis to replace the conventional softmax layer.
It leverages Weibull variational approximations for efficient training and closed-form KL divergences, ensuring robust uncertainty estimation.
BNDL enhances model interpretability and disentanglement while achieving competitive or superior accuracy on benchmark vision datasets.

The Bayesian Non-negative Decision Layer (BNDL) is a variational Bayesian approach to the neural network decision layer, formulated as a conditional Bayesian non-negative factor analysis. BNDL addresses limitations in standard deep neural network @@@@1@@@@ and interpretability by introducing a fully Bayesian, nonnegative, and sparse final layer. Leveraging stochastic latent variables governed by non-negative support and sparse priors, BNDL provides robust uncertainty quantification and theoretical guarantees for disentangled representations. BNDL incorporates a tailored variational inference scheme using Weibull variational distributions, yielding closed-form KL divergences and efficient end-to-end training. Empirical evaluations demonstrate improvements in both uncertainty calibration and local interpretability, together with competitive or superior classification accuracy on benchmark vision datasets (Hu et al., 28 May 2025).

1. Model Architecture and Probabilistic Formulation

BNDL replaces the conventional softmax linear layer in deep networks with a conditional Bayesian non-negative factor model. The generative process is:

For each observation $x_j$ $x_{j}$ :
- Features $h_j = f_\theta(x_j)$ are extracted by a deterministic (often deep) network.
- Local latent factors $z_j \in \mathbb{R}_+^K$ are sampled as $p(z_j|x_j) = \prod_k \text{Gamma}(z_{jk}; \alpha_{jk}=f_\theta(x_j)_k, \beta=1)$ .
- Global factor loading matrix $\Phi \in \mathbb{R}_+^{K \times C}$ is assigned an elementwise prior $p(\Phi) = \prod_{k,c} \text{Gamma}(\Phi_{kc};1,1)$ .
- The categorical output is modeled with $p(y_j|z_j,\Phi) = \text{Category}(y_j; z_j \Phi)$ , so the unnormalized logit for class $c$ is $\sum_k z_{jk}\Phi_{kc}$ .

This yields a hierarchical model:

$p(y, z, \Phi|x) = p(y|z, \Phi)\, p(z|x)\, p(\Phi).$

The joint over data is thus a conditional nonnegative matrix factorization: $Y \approx Z \Phi$ with $Z, \Phi \geq 0$ . Practically, the standard deterministic feature extractor parameterized by $\theta$ is followed by a Bayesian factorization layer.

2. Variational Inference and Optimization

Exact Bayesian inference in this model is intractable. BNDL addresses this by using Weibull variational approximations for both local and global latent variables:

The variational posterior for $z_j$ is $q(z_j|x_j) = \prod_k \text{Weibull}(z_{jk}; k_j(x_j), \lambda_{j,k}(x_j))$ , with $k_j$ and $\lambda_j$ parameters produced by neural networks from the extracted feature $h_j$ . The parameterizations are $k_j = \text{Softplus}(W_k h_j)$ , $\lambda_j = \text{ReLU}(W_\lambda h_j)/\exp(1+1/\epsilon)$ .
The variational posterior for the global factor $\Phi$ is $q(\Phi) = \prod_{k,c} \text{Weibull}(\Phi_{kc}; k_{\Phi,kc}, \lambda_{\Phi,kc})$ , with $k_\Phi, \lambda_\Phi$ global free parameters optimized by SGD.

Weibull distributions are chosen for their compatibility with the nonnegativity constraint and enable reparameterized gradients. The KL divergence between Weibull and Gamma admits a closed-form:

$\text{KL}(\text{Weibull}(k,\lambda) \| \text{Gamma}(\alpha,\beta)) = \gamma \alpha/k - \alpha \ln \lambda + \ln k + \beta \lambda \Gamma(1+1/k) - \gamma - 1 - \alpha \ln \beta + \ln \Gamma(\alpha),$

where $\gamma \approx 0.577$ is Euler's constant. The evidence lower bound (ELBO) optimized during training is

$\mathcal{L} = \sum_{j=1}^J \mathbb{E}_{q(z_j,\Phi)}[\ln p(y_j|z_j,\Phi)] - \sum_{j=1}^J \mathrm{KL}(q(z_j|x_j)\|p(z_j|x_j)) - \mathrm{KL}(q(\Phi)\|p(\Phi)).$

All learnable parameters, including the feature extractor $\theta$ , variational parameters $\{k_j, \lambda_j\}$ , and global parameters $\{k_\Phi, \lambda_\Phi\}$ , are optimized via stochastic gradient ascent with reparameterization for sampling.

3. Inductive Biases: Sparsity, Non-negativity, and Disentanglement

The non-negativity of both $z$ and $\Phi$ follows from their Gamma/Weibull priors. Sparsity in the factors is induced via small shape parameters $\alpha_{jk} = f_\theta(x_j)_k$ in the Gamma prior and reinforced by ReLU thresholding in the final layer weights: $w' = \text{ReLU}(w - \alpha_0)$ with $\alpha_0$ a small hyperparameter.

Disentanglement is theoretically supported by a partial identifiability guarantee from nonnegative matrix factorization (NMF):

If there exists a "selective window" (a row of $Z$ with only one strong activation for a factor) and a "sparsity constraint" (columns of $\Phi$ have at least $r-1$ zeros for rank $r$ ), then the $k$ -th column of $\Phi$ is uniquely identifiable (up to permutation and scaling) [Gillis 2023]. In BNDL, the architecture and priors make these sparsity and selective-activation conditions likely to hold in practice, promoting disentangled class concepts in $\Phi$ (Hu et al., 28 May 2025).

4. Uncertainty Estimation and Model Interpretability

Aleatoric uncertainty is captured by sampling $z_j$ from $q(z_j|x_j)$ at test time, reflecting input-driven variability. Epistemic uncertainty is modeled by sampling $\Phi$ from $q(\Phi)$ . A single forward pass produces stochastic logits for each draw. Aggregating $T$ such passes provides a predictive empirical distribution

$\hat{p}(y|x) \approx \frac{1}{T} \sum_{t=1}^T \text{Category}(y; z^{(t)}\Phi^{(t)}).$

Model uncertainty is quantified by performing a two-sample t-test on the top-two class scores over the ensemble, declaring uncertainty if $p < 0.05$ . The Patch-Accuracy vs. Patch-Uncertainty (PAvPU) metric quantifies uncertainty quality:

$\text{PAvPU} = \frac{n_{ac} + n_{iu}}{n_{ac} + n_{au} + n_{ic} + n_{iu}},$

where $n_{ac}$ ("accurate-certain") and $n_{iu}$ ("inaccurate-uncertain") favor both reliability in classification and in uncertainty declarations.

Interpretability is enhanced by the non-negativity and sparsity of $\Phi$ , so each class aligns with an additive “concept.” Local explanations generate heatmaps by (i) selecting the latent factor $k$ with largest $z_k \cdot \Phi_{k,\hat{y}}$ for input $x$ , and (ii) applying LIME to regress the superpixel map of $x$ onto $z_k$ 's activation, thus isolating the pixels most responsible for activations of each "concept." Non-negativity ensures feature maps do not cancel, directly mapping activations to human-interpretable image parts.

5. Empirical Evaluation

BNDL demonstrates robust performance on CIFAR-10, CIFAR-100, and ImageNet-1k using ResNet-18, ResNet-50, and ViT backbones. Comparative results (accuracy and PAvPU) are summarized below:

Dataset	Model	ACC (%)	PAvPU (%)
CIFAR-10	ResNet-BNDL	95.54±0.08	95.58±0.20
	MC Dropout	94.54±0.03	78.83±0.12
	BM	94.07±0.07	93.98±0.30
	CARD	90.93±0.02	91.11±0.04
CIFAR-100	ResNet-BNDL	79.82±0.13	81.10±0.21
	MC Dropout	78.12±0.06	64.41±0.22
	BM	75.81±0.34	77.13±0.67
	CARD	71.42±0.01	71.48±0.03
ImageNet-1k	ResNet-BNDL	77.01±0.14	77.66±0.03
	MC Dropout	75.98±0.08	76.50±0.02
	CARD	76.20±0.00	76.29±0.01

Disentanglement is quantified using SEPIN@k metrics. For ImageNet-1k/ResNet-50, ResNet-BNDL achieves higher SEPIN scores at all $k$ compared to ResNet-50:

Metric	SEPIN@1	SEPIN@10	SEPIN@100	SEPIN@1000	SEPIN@all
ResNet-50	1.50±0.02	1.03±0.01	0.60±0.01	0.31±0.01	0.23±0.01
ResNet-BNDL	2.59±0.03	2.12±0.01	1.30±0.01	0.65±0.01	0.44±0.01

BNDL achieves $>99\%$ zero weights in the final layer with negligible loss in accuracy; for ImageNet, 75.7% accuracy is retained at 0.24% weight density. Qualitative LIME and Grad-CAM analyses demonstrate that BNDL’s factors localize to semantically meaningful image parts more precisely than standard ResNets, reducing multifaceted, entangled activations (Hu et al., 28 May 2025).

6. Summary and Implications

BNDL provides an efficient Bayesian decision layer for deep networks with:

Well-calibrated single-pass uncertainty estimation via stochastic latent factors and closed-form KL divergences.
Strong sparsity and non-negativity imposed by Gamma/Weibull priors and explicit thresholding.
Theoretical guarantees for disentangled and interpretable decision representations under mild identifiability conditions.
Empirical gains in calibration, interpretability, and accuracy across large-scale vision benchmarks.

A plausible implication is that replacing the standard softmax layer with a Bayesian non-negative structure such as BNDL enables generic deep feature extractors to yield more explainable, robust, and trustworthy predictions without sacrificing accuracy (Hu et al., 28 May 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Enhancing Uncertainty Estimation and Interpretability via Bayesian Non-negative Decision Layer (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bayesian Non-negative Decision Layer (BNDL).