Bayes-Optimal Classification: Non-Uniform Priors

Updated 8 May 2026

The paper presents a framework where incorporating non-uniform priors shifts the Bayes decision rule, effectively handling class imbalances by biasing the classification threshold.
It details the methodology for learning optimal priors via radial-gradient search and cross-validation, achieving significant performance improvements such as a 443% gain in PPV.
The study outlines adaptive techniques including posterior reweighting and the BOLT loss for neural networks, ensuring Bayes-optimal classification without retraining when priors shift.

Bayes-optimal classification under non-uniform priors refers to the design and analysis of classifiers that explicitly incorporate arbitrary class prior distributions, as opposed to uniform or “noninformative” priors, into their decision rules and learning procedures. This framework is central to both theoretical and applied Bayesian machine learning, affecting not only the form of the decision boundary but also the achievable classification performance, model selection, and the calibration of risk under various data imbalances and target metrics.

1. Fundamental Decision Rule: Impact of Non-Uniform Priors

In Bayesian classification, the decision for an observation $x$ is based on the posterior $P(C_k|x)$ , which, by Bayes’ formula, depends directly on class priors $\pi_k$ : $P(C_k|x) = \frac{P(x|C_k)\,\pi_k}{\sum_{j=1}^K P(x|C_j)\,\pi_j},\quad \pi_k\ge0,\,\sum_k\pi_k=1.$ For binary classification ( $K=2$ ), the Bayes-optimal rule becomes: $\text{decide } C_+ \text{ iff } \frac{P(x|+)}{P(x|-)} > \frac{\pi_-}{\pi_+}.$ This shows that non-uniform priors $\pi_+\neq\pi_-$ shift the decision threshold, directly biasing the classifier towards or against classes based on prior beliefs or observed base rates. Such tilting is essential in domains with class imbalance, where naïve uniform priors can yield suboptimal outcomes and misaligned decision regions (Kayaalp, 2021).

2. Learning Priors to Optimize Target Metrics

The optimality of a classifier depends not just on the prior but on how well that prior aligns with specific evaluation metrics relevant to the application. In “Learning Optimal Bayesian Prior Probabilities from Data” (Kayaalp, 2021), it is demonstrated that “noninformative” uniform priors (e.g., Bayes-Laplace, $\alpha_+=\alpha_-=1$ ) frequently yield suboptimal outcomes for practical measures such as Positive Predictive Value (PPV).

To address this, priors $\pi$ (parameterized as Dirichlet hyperparameters $\alpha_+, \alpha_-$ ) are learned directly from data by maximizing the application-driven objective, e.g., PPV: $P(C_k|x)$ 0 Optimization is performed via radial-gradient search over a grid in the Dirichlet simplex, evaluating PPV through leave-one-out cross-validation. This search identifies priors that, when used for classification, maximize the desired predictive metric, outperforming uniform-prior baselines by large margins (e.g., relative gains in PPV@250 up to 443% in text classification tasks) (Kayaalp, 2021).

3. Posterior Adaptation and Dynamic Prior Reweighting

In many real-world scenarios, test-time class priors differ from training priors, necessitating dynamic adaptation of classification thresholds and posteriors. Given a trained model’s posteriors $P(C_k|x)$ 1 under training priors $P(C_k|x)$ 2, the class-conditional likelihoods $P(C_k|x)$ 3 can be recovered (up to a scale) as: $P(C_k|x)$ 4 For any new test-time priors $P(C_k|x)$ 5, the updated posteriors are computed via reweighting: $P(C_k|x)$ 6 This method, known as posterior adaptation, is computationally efficient ( $P(C_k|x)$ 7) and requires only knowledge of the original priors and posteriors, under the condition that all $P(C_k|x)$ 8 are strictly positive and the posteriors are well-calibrated. It preserves Bayes-optimality under the new prior regime and obviates retraining when only the class frequencies change (Davis, 2020).

4. Generalized Bayes-Optimality Theorems Beyond Accuracy

Classical Bayes-optimality is typically framed around misclassification error (accuracy) as the objective, but in imbalanced or application-critical settings, alternate confusion-matrix-based metrics (such as PPV, recall, $P(C_k|x)$ 9, etc.) may be more meaningful. “Optimal Binary Classification Beyond Accuracy” (Singh et al., 2021) establishes that, for arbitrary confusion-matrix measures, the Bayes-optimal classifier is a regression-thresholding classifier (RTC): $\pi_k$ 0 where $\pi_k$ 1. Critically, under non-uniform priors and nonstandard metrics, deterministic thresholding may be suboptimal; stochastic classifiers are sometimes necessary to maximize the chosen metric due to the potentially atomic nature of $\pi_k$ 2.

Uniform Class Imbalance (UCI)—where the rare class prior vanishes with sample size—alters the achievable finite-sample performance and entails scaling of sample size and hyperparameters proportional to the prior. This analytical framework provides the first finite-sample guarantees for $\pi_k$ 3-NN classification under general, nonuniform priors and metrics (Singh et al., 2021).

5. Asymptotic Bayes Optimality with Shrinkage Priors in High-Dimension

In high-dimensional multiple testing and sparse estimation regimes, Bayes-optimality under non-uniform priors governs the achievable risk and sparsity adaptation. Two major frameworks have been developed:

Two-groups model: Each mean parameter $\pi_k$ 4 follows a spike-and-slab prior,

$\pi_k$ 5

with $\pi_k$ 6 the prior null probability, typically non-uniform and possibly vanishing with $\pi_k$ 7.

One-group global-local shrinkage: Each $\pi_k$ 8 is modeled as $\pi_k$ 9, with $P(C_k|x) = \frac{P(x|C_k)\,\pi_k}{\sum_{j=1}^K P(x|C_j)\,\pi_j},\quad \pi_k\ge0,\,\sum_k\pi_k=1.$ 0 and $P(C_k|x) = \frac{P(x|C_k)\,\pi_k}{\sum_{j=1}^K P(x|C_j)\,\pi_j},\quad \pi_k\ge0,\,\sum_k\pi_k=1.$ 1 random and $P(C_k|x) = \frac{P(x|C_k)\,\pi_k}{\sum_{j=1}^K P(x|C_j)\,\pi_j},\quad \pi_k\ge0,\,\sum_k\pi_k=1.$ 2 encoding global sparsity. The posterior mean shrinkage weight $P(C_k|x) = \frac{P(x|C_k)\,\pi_k}{\sum_{j=1}^K P(x|C_j)\,\pi_j},\quad \pi_k\ge0,\,\sum_k\pi_k=1.$ 3 acts as a signal-inclusion probability.

Bayes-optimal classification is achieved by thresholding this signal probability. Under sparsity assumptions and appropriate scaling (e.g., $P(C_k|x) = \frac{P(x|C_k)\,\pi_k}{\sum_{j=1}^K P(x|C_j)\,\pi_j},\quad \pi_k\ge0,\,\sum_k\pi_k=1.$ 4), both oracle-tuned and empirical-Bayes versions of one-group shrinkage priors attain asymptotic Bayes-optimality, with risk matching the two-groups oracle up to $P(C_k|x) = \frac{P(x|C_k)\,\pi_k}{\sum_{j=1}^K P(x|C_j)\,\pi_j},\quad \pi_k\ge0,\,\sum_k\pi_k=1.$ 5 (Paul et al., 2024, Paul et al., 2022). The choice of prior directly influences the detection threshold and Bayes risk; misspecified or non-adaptive priors yield strictly suboptimal procedures.

6. Neural Network Training for Bayes-Optimality under Arbitrary Priors

Recent methodology extends Bayes-optimality principles to deep learning models via convex $P(C_k|x) = \frac{P(x|C_k)\,\pi_k}{\sum_{j=1}^K P(x|C_j)\,\pi_j},\quad \pi_k\ge0,\,\sum_k\pi_k=1.$ 6-divergence-based upper bounds on Bayes error. The Bayes-optimal learning threshold (BOLT) loss is constructed such that its minimization enforces convergence to the Bayes error bound under any given class prior. For priors $P(C_k|x) = \frac{P(x|C_k)\,\pi_k}{\sum_{j=1}^K P(x|C_j)\,\pi_j},\quad \pi_k\ge0,\,\sum_k\pi_k=1.$ 7, the BOLT objective incorporates these priors explicitly: $P(C_k|x) = \frac{P(x|C_k)\,\pi_k}{\sum_{j=1}^K P(x|C_j)\,\pi_j},\quad \pi_k\ge0,\,\sum_k\pi_k=1.$ 8 The BOLT loss is designed so that empirical risk minimization via gradient descent under either correct or reweighted sampling distributions achieves Bayes-optimal accuracy, even under severe class imbalance. Empirical evaluations confirm that BOLT-trained classifiers achieve error rates matching the theoretical Bayes bound and outperform standard cross-entropy when class priors are highly non-uniform (Naeini et al., 13 Jan 2025).

7. Theoretical and Practical Considerations

Non-uniform, data-driven priors function as a form of regularization and second-order feature selection, raising or lowering the barrier for evidence to influence posterior class probabilities. This mechanism generalizes across naive Bayes, Bayesian neural nets, and large-scale sparse estimation problems. Posterior-adaptation and prior-reweighting schemes enable efficient recalibration without retraining when priors shift at test time, provided the class-conditional densities are stable. However, care must be taken with extremely small class priors (for which ratio calculations can become numerically unstable), and proper calibration of posteriors is critical to avoid propagating errors through the model (Kayaalp, 2021, Davis, 2020).

In summary, Bayes-optimal classification under non-uniform priors subsumes classical Bayesian decision theory, modern data-driven prior learning, confusion-matrix-consistent optimality, and scalable empirical and neural methodologies for learning and adapting classifiers in the presence of arbitrary class imbalance and high-dimensional uncertainty. These rigorously grounded results provide the mathematical foundations for principled, application-aware classifier design across contemporary machine learning domains.