Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bayesian Non-negative Decision Layer (BNDL)

Updated 7 February 2026
  • BNDL is a variational Bayesian decision layer that uses nonnegative factor analysis to replace the conventional softmax layer.
  • It leverages Weibull variational approximations for efficient training and closed-form KL divergences, ensuring robust uncertainty estimation.
  • BNDL enhances model interpretability and disentanglement while achieving competitive or superior accuracy on benchmark vision datasets.

The Bayesian Non-negative Decision Layer (BNDL) is a variational Bayesian approach to the neural network decision layer, formulated as a conditional Bayesian non-negative factor analysis. BNDL addresses limitations in standard deep neural network @@@@1@@@@ and interpretability by introducing a fully Bayesian, nonnegative, and sparse final layer. Leveraging stochastic latent variables governed by non-negative support and sparse priors, BNDL provides robust uncertainty quantification and theoretical guarantees for disentangled representations. BNDL incorporates a tailored variational inference scheme using Weibull variational distributions, yielding closed-form KL divergences and efficient end-to-end training. Empirical evaluations demonstrate improvements in both uncertainty calibration and local interpretability, together with competitive or superior classification accuracy on benchmark vision datasets (Hu et al., 28 May 2025).

1. Model Architecture and Probabilistic Formulation

BNDL replaces the conventional softmax linear layer in deep networks with a conditional Bayesian non-negative factor model. The generative process is:

  • For each observation xjx_j:
    • Features hj=fθ(xj)h_j = f_\theta(x_j) are extracted by a deterministic (often deep) network.
    • Local latent factors zjR+Kz_j \in \mathbb{R}_+^K are sampled as p(zjxj)=kGamma(zjk;αjk=fθ(xj)k,β=1)p(z_j|x_j) = \prod_k \text{Gamma}(z_{jk}; \alpha_{jk}=f_\theta(x_j)_k, \beta=1).
    • Global factor loading matrix ΦR+K×C\Phi \in \mathbb{R}_+^{K \times C} is assigned an elementwise prior p(Φ)=k,cGamma(Φkc;1,1)p(\Phi) = \prod_{k,c} \text{Gamma}(\Phi_{kc};1,1).
    • The categorical output is modeled with p(yjzj,Φ)=Category(yj;zjΦ)p(y_j|z_j,\Phi) = \text{Category}(y_j; z_j \Phi), so the unnormalized logit for class cc is kzjkΦkc\sum_k z_{jk}\Phi_{kc}.

This yields a hierarchical model:

p(y,z,Φx)=p(yz,Φ)p(zx)p(Φ).p(y, z, \Phi|x) = p(y|z, \Phi)\, p(z|x)\, p(\Phi).

The joint over data is thus a conditional nonnegative matrix factorization: YZΦY \approx Z \Phi with Z,Φ0Z, \Phi \geq 0. Practically, the standard deterministic feature extractor parameterized by θ\theta is followed by a Bayesian factorization layer.

2. Variational Inference and Optimization

Exact Bayesian inference in this model is intractable. BNDL addresses this by using Weibull variational approximations for both local and global latent variables:

  • The variational posterior for zjz_j is q(zjxj)=kWeibull(zjk;kj(xj),λj,k(xj))q(z_j|x_j) = \prod_k \text{Weibull}(z_{jk}; k_j(x_j), \lambda_{j,k}(x_j)), with kjk_j and λj\lambda_j parameters produced by neural networks from the extracted feature hjh_j. The parameterizations are kj=Softplus(Wkhj)k_j = \text{Softplus}(W_k h_j), λj=ReLU(Wλhj)/exp(1+1/ϵ)\lambda_j = \text{ReLU}(W_\lambda h_j)/\exp(1+1/\epsilon).
  • The variational posterior for the global factor Φ\Phi is q(Φ)=k,cWeibull(Φkc;kΦ,kc,λΦ,kc)q(\Phi) = \prod_{k,c} \text{Weibull}(\Phi_{kc}; k_{\Phi,kc}, \lambda_{\Phi,kc}), with kΦ,λΦk_\Phi, \lambda_\Phi global free parameters optimized by SGD.

Weibull distributions are chosen for their compatibility with the nonnegativity constraint and enable reparameterized gradients. The KL divergence between Weibull and Gamma admits a closed-form:

KL(Weibull(k,λ)Gamma(α,β))=γα/kαlnλ+lnk+βλΓ(1+1/k)γ1αlnβ+lnΓ(α),\text{KL}(\text{Weibull}(k,\lambda) \| \text{Gamma}(\alpha,\beta)) = \gamma \alpha/k - \alpha \ln \lambda + \ln k + \beta \lambda \Gamma(1+1/k) - \gamma - 1 - \alpha \ln \beta + \ln \Gamma(\alpha),

where γ0.577\gamma \approx 0.577 is Euler's constant. The evidence lower bound (ELBO) optimized during training is

L=j=1JEq(zj,Φ)[lnp(yjzj,Φ)]j=1JKL(q(zjxj)p(zjxj))KL(q(Φ)p(Φ)).\mathcal{L} = \sum_{j=1}^J \mathbb{E}_{q(z_j,\Phi)}[\ln p(y_j|z_j,\Phi)] - \sum_{j=1}^J \mathrm{KL}(q(z_j|x_j)\|p(z_j|x_j)) - \mathrm{KL}(q(\Phi)\|p(\Phi)).

All learnable parameters, including the feature extractor θ\theta, variational parameters {kj,λj}\{k_j, \lambda_j\}, and global parameters {kΦ,λΦ}\{k_\Phi, \lambda_\Phi\}, are optimized via stochastic gradient ascent with reparameterization for sampling.

3. Inductive Biases: Sparsity, Non-negativity, and Disentanglement

The non-negativity of both zz and Φ\Phi follows from their Gamma/Weibull priors. Sparsity in the factors is induced via small shape parameters αjk=fθ(xj)k\alpha_{jk} = f_\theta(x_j)_k in the Gamma prior and reinforced by ReLU thresholding in the final layer weights: w=ReLU(wα0)w' = \text{ReLU}(w - \alpha_0) with α0\alpha_0 a small hyperparameter.

Disentanglement is theoretically supported by a partial identifiability guarantee from nonnegative matrix factorization (NMF):

  • If there exists a "selective window" (a row of ZZ with only one strong activation for a factor) and a "sparsity constraint" (columns of Φ\Phi have at least r1r-1 zeros for rank rr), then the kk-th column of Φ\Phi is uniquely identifiable (up to permutation and scaling) [Gillis 2023]. In BNDL, the architecture and priors make these sparsity and selective-activation conditions likely to hold in practice, promoting disentangled class concepts in Φ\Phi (Hu et al., 28 May 2025).

4. Uncertainty Estimation and Model Interpretability

Aleatoric uncertainty is captured by sampling zjz_j from q(zjxj)q(z_j|x_j) at test time, reflecting input-driven variability. Epistemic uncertainty is modeled by sampling Φ\Phi from q(Φ)q(\Phi). A single forward pass produces stochastic logits for each draw. Aggregating TT such passes provides a predictive empirical distribution

p^(yx)1Tt=1TCategory(y;z(t)Φ(t)).\hat{p}(y|x) \approx \frac{1}{T} \sum_{t=1}^T \text{Category}(y; z^{(t)}\Phi^{(t)}).

Model uncertainty is quantified by performing a two-sample t-test on the top-two class scores over the ensemble, declaring uncertainty if p<0.05p < 0.05. The Patch-Accuracy vs. Patch-Uncertainty (PAvPU) metric quantifies uncertainty quality:

PAvPU=nac+niunac+nau+nic+niu,\text{PAvPU} = \frac{n_{ac} + n_{iu}}{n_{ac} + n_{au} + n_{ic} + n_{iu}},

where nacn_{ac} ("accurate-certain") and niun_{iu} ("inaccurate-uncertain") favor both reliability in classification and in uncertainty declarations.

Interpretability is enhanced by the non-negativity and sparsity of Φ\Phi, so each class aligns with an additive “concept.” Local explanations generate heatmaps by (i) selecting the latent factor kk with largest zkΦk,y^z_k \cdot \Phi_{k,\hat{y}} for input xx, and (ii) applying LIME to regress the superpixel map of xx onto zkz_k's activation, thus isolating the pixels most responsible for activations of each "concept." Non-negativity ensures feature maps do not cancel, directly mapping activations to human-interpretable image parts.

5. Empirical Evaluation

BNDL demonstrates robust performance on CIFAR-10, CIFAR-100, and ImageNet-1k using ResNet-18, ResNet-50, and ViT backbones. Comparative results (accuracy and PAvPU) are summarized below:

Dataset Model ACC (%) PAvPU (%)
CIFAR-10 ResNet-BNDL 95.54±0.08 95.58±0.20
MC Dropout 94.54±0.03 78.83±0.12
BM 94.07±0.07 93.98±0.30
CARD 90.93±0.02 91.11±0.04
CIFAR-100 ResNet-BNDL 79.82±0.13 81.10±0.21
MC Dropout 78.12±0.06 64.41±0.22
BM 75.81±0.34 77.13±0.67
CARD 71.42±0.01 71.48±0.03
ImageNet-1k ResNet-BNDL 77.01±0.14 77.66±0.03
MC Dropout 75.98±0.08 76.50±0.02
CARD 76.20±0.00 76.29±0.01

Disentanglement is quantified using SEPIN@k metrics. For ImageNet-1k/ResNet-50, ResNet-BNDL achieves higher SEPIN scores at all kk compared to ResNet-50:

Metric SEPIN@1 SEPIN@10 SEPIN@100 SEPIN@1000 SEPIN@all
ResNet-50 1.50±0.02 1.03±0.01 0.60±0.01 0.31±0.01 0.23±0.01
ResNet-BNDL 2.59±0.03 2.12±0.01 1.30±0.01 0.65±0.01 0.44±0.01

BNDL achieves >99%>99\% zero weights in the final layer with negligible loss in accuracy; for ImageNet, 75.7% accuracy is retained at 0.24% weight density. Qualitative LIME and Grad-CAM analyses demonstrate that BNDL’s factors localize to semantically meaningful image parts more precisely than standard ResNets, reducing multifaceted, entangled activations (Hu et al., 28 May 2025).

6. Summary and Implications

BNDL provides an efficient Bayesian decision layer for deep networks with:

  • Well-calibrated single-pass uncertainty estimation via stochastic latent factors and closed-form KL divergences.
  • Strong sparsity and non-negativity imposed by Gamma/Weibull priors and explicit thresholding.
  • Theoretical guarantees for disentangled and interpretable decision representations under mild identifiability conditions.
  • Empirical gains in calibration, interpretability, and accuracy across large-scale vision benchmarks.

A plausible implication is that replacing the standard softmax layer with a Bayesian non-negative structure such as BNDL enables generic deep feature extractors to yield more explainable, robust, and trustworthy predictions without sacrificing accuracy (Hu et al., 28 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bayesian Non-negative Decision Layer (BNDL).