Confusion Matrix–Based Loss Functions

Updated 14 February 2026

Confusion matrix–based loss functions are methods that incorporate the entire error distribution of a classifier to provide nuanced misclassification penalties and address class imbalance.
They utilize differentiable surrogates, such as sigmoidF1 and AnyLoss, to approximate non-decomposable metrics for effective gradient-based optimization.
These approaches are backed by rigorous theoretical guarantees including operator norm stability and PAC-Bayesian bounds, ensuring robust performance in multiclass and multilabel scenarios.

A confusion matrix–based loss function is any objective for supervised learning whose formal construction, theoretical foundations, or optimization algorithm directly incorporates the confusion matrix or its functionals—rather than only scalar aggregations such as accuracy, cross-entropy, or per-class errors. Central to this paradigm is the ability to encode nuanced misclassification penalties, address class imbalance, and often to align optimization with non-decomposable evaluation metrics. This approach generalizes standard cost-sensitive and margin-based learning frameworks, provides rigorous generalization theory via matrix-norm stability or PAC-Bayesian bounds, and supports algorithmic designs that admit fully @@@@1@@@@ for essentially any metric arising from the confusion matrix.

1. Formal Definitions and Motivations

In multiclass settings, let $h: X \to \{1, \ldots, K\}$ denote a classifier, $p_\ell = P[y = \ell]$ the class priors, and $A(h) = [a_{\ell j}]$ the true confusion matrix with $a_{\ell j} = P(h(x) = j \mid y = \ell)$ and $\sum_j a_{\ell j} = 1$ per row. An “error-only” confusion matrix $C(h)$ is defined by zeroing the diagonal, i.e., $c_{\ell j} = 0$ if $\ell = j$ , and $a_{\ell j}$ otherwise. On a finite sample $S = \{ (x_i, y_i) \}_{i=1}^m$ , the empirical version is

$\hat{c}_{\ell j} = \begin{cases} 0 & \text{if } j = \ell \ \frac{1}{m_\ell} \sum_{i: y_i = \ell} \mathbf{1}[h(x_i) = j] & \text{if } j \neq \ell \end{cases}$

with $m_\ell = |\{i: y_i = \ell\}|$ .

Minimizing a confusion matrix–based loss is fundamentally motivated by the limitations of accuracy and misclassification rate, especially in imbalanced multiclass or multilabel regimes. Scalar metrics may obscure severe confusions in minority classes; in contrast, by targeting matrix-valued properties or composed scalarizations (e.g., the operator norm, F1, MCC), these losses enforce more equitable per-class performance and allow explicit trade-offs among types of errors (Koço et al., 2013).

2. Operator Norm and Confusion Matrix–Norm Loss

Prominent in multiclass learning, the operator (spectral) norm $\|C(h)\|$ serves as a principled loss: $\|C(h)\| = \max_{v \neq 0} \frac{\|C(h) v\|_2}{\|v\|_2} = \sqrt{ \lambda_{\max}(C(h)^\top C(h)) }$ This norm provides a tight upper bound on the risk: $P(h(x) \neq y) = \|p C(h)\|_1 \leq \sqrt{K} \|C(h)\|$ Row normalization ensures that errors in minority and majority classes contribute equally. Optimizing $\|C(h)\|$ is thus especially advantageous in imbalanced settings, enforcing uniform error reduction and mitigating “masked” minority-class confusions (Koço et al., 2013).

Direct minimization is intractable; thus, surrogate upper bounds are employed, such as: $\|C_S\|^2 = \lambda_{\max}(C_S^\top C_S) \leq \operatorname{Tr}(C_S^\top C_S)$ This trace can be re-expressed in a per-example exponential form suitable for gradient-based or boosting-style minimization: $\sum_{i=1}^m \frac{1}{m_{y_i}} \sum_{j \neq y_i} \exp(f_h(i, j) - f_h(i, y_i))$ (Koço et al., 2013, Machart et al., 2012).

3. Generic Frameworks: Differentiable Surrogates for Confusion-Matrix Metrics

Many evaluation metrics (e.g., F1, precision, MCC) are functions of entries in the confusion matrix and are inherently non-differentiable. Several frameworks introduce smooth surrogates:

sigmoidF1 (Bénédict et al., 2021): Replaces hard counts with soft (sigmoid-based) surrogates:

$\tilde{tp}^{(j)} = \sum_{i} S(z_i^j; \beta, \eta) y_i^j, \quad \tilde{fp}^{(j)} = \sum_{i} S(z_i^j; \beta, \eta) (1 - y_i^j)$

The F1 is replaced by

$\mathcal{L}_{\tilde{F_1}} = 1 - \frac{2\tilde{tp}}{2\tilde{tp} + \tilde{fp} + \tilde{fn}}$

This approach generalizes: any confusion-matrix–based $M(tp, fp, fn, tn)$ can be converted to a smooth, globally differentiable loss by using sigmoid-softened counts (Bénédict et al., 2021).

AnyLoss (Han et al., 2024): Constructs soft confusion matrix entries via amplified sigmoids:

$\tilde{TP} = \sum_i y_i A(p_i), \quad A(p) = \frac{1}{1 + \exp(-L(p-0.5))}$

For any user-supplied $M(\tilde{TP}, \tilde{TN}, \tilde{FP}, \tilde{FN})$ , the loss is $\mathcal{L} = 1 - M(\ldots)$ . Amplification scale $L$ tunes the sharpness and gradient properties. This enables fully differentiable optimization for arbitrary confusion-matrix–based metrics, with empirical superiority, especially for imbalanced regimes (Han et al., 2024).

Score-Oriented Loss (SOL) (Marchetti et al., 2021): Randomizes the decision threshold $\tau$ using a probability distribution $F$ , and computes the expected confusion matrix $\bar{CM}_F$ . For any scalar skill score $s(CM)$ :

$\bar{s}_F = s(\bar{CM}_F) \implies \ell_{SOL}(w) = -\bar{s}_F$

This probabilistic framework allows control over the effective decision boundary and aligns training objectives directly with test metrics (Marchetti et al., 2021).

4. Algorithmic Realizations and Optimization Procedures

Several algorithmic strategies underlie confusion matrix–based optimization:

CoMBo (Confusion Matrix BOosting) (Koço et al., 2013): Adapts the AdaBoost.MM procedure to optimize a surrogate for $\|C_S\|$ , using class-rebalanced exponential costs to focus learning on hard-to-classify and minority-class examples. At each boosting round, a weak learner is chosen with respect to a cost-matrix encoding exponential penalties for off-diagonal (“error”) entries; an ensemble model’s additive scores are updated accordingly. Theoretical guarantees show exponential decay of the norm surrogate (Koço et al., 2013).
Differentiable Surrogates in Neural Networks: Smooth versions of confusion-matrix entries and metrics, as in sigmoidF1 and AnyLoss, are used directly as losses in neural network training via standard autodiff libraries. Implementation requires an additional forward sigmoid per sample and metric-specific algebraic gradients, with computational overhead comparable to binary cross-entropy (Bénédict et al., 2021, Han et al., 2024).
SOL Integration: Uses batch-wise forward evaluation of CDF-transformed predictions, accumulation of expected confusion-matrix counts, and direct evaluation of target metric over these averages. Support for non-linear and non-decomposable metrics is provided via Taylor approximations and empirical averaging; practical implementations recommend tuning the threshold-distribution parameters to favor the intended deployment point (Marchetti et al., 2021).

5. Theoretical Guarantees and Generalization Bounds

Theoretical analyses for confusion matrix–based losses focus on uniform stability, concentration inequalities, and PAC-Bayesian risk bounds:

Uniform Stability in Operator Norm (Machart et al., 2012): If a learning algorithm is “confusion-stable” (operator-norm difference in per-example loss matrices is $B/m_{y_i}$ when one sample is removed), one obtains, with high probability,

$\|\widehat{C}(h_S) - C(h_S)\| \leq 2B \sum_q \frac{1}{m_q} + Q \sqrt{8 \ln \frac{Q^2}{\delta} \left( 4 \sqrt{m^*} + M\sqrt{\frac{Q}{m^*}} \right)}$

for $m^* = \min_q m_q$ . This bound highlights the critical role of per-class sample sizes and motivates resampling in imbalanced domains (Machart et al., 2012).

PAC-Bayesian Confusion Matrix Bounds (Morvant et al., 2012): For a fixed loss-weight matrix $L$ and randomized predictor $Q$ , high-probability bounds on the operator-norm deviation of the (Gibbs) confusion matrix are given as:

$\| \widehat{C}^{G_Q} - C^{G_Q} \| \leq \sqrt{ \frac{8Q}{m_- - 8Q} \left( KL(Q \| P) + \ln \frac{m_-}{4\delta} \right)}$

with $m_- = \min_p n_p$ the minimal class frequency. The bounds guide allocation of training samples and the trade-off in loss-weight selection (Morvant et al., 2012).

Exponential Convergence of Surrogates: For boosting-style methods, if weak learners beat uniform baselines by margin $\gamma$ , the surrogate loss and thus the confusion-matrix norm decrease as $O(e^{-T\gamma^2/2})$ after $T$ boosting rounds (Koço et al., 2013).

6. Metric-Specific Loss Construction and Customization

All confusion-matrix–based metrics can be expressed as functions of the underlying counts (TP, TN, FP, FN or their multiclass generalizations). Using soft surrogates (sigmoid, amplifying sigmoids, CDF smoothing), one constructs globally differentiable approximation losses for:

F1, Precision, Recall, AUC: Each of these admits a $1 - \widetilde{M}$ loss, where $\widetilde{M}$ is the surrogate computed on soft counts (Bénédict et al., 2021, Han et al., 2024).
Matthews Correlation Coefficient (MCC): Explicit algebraic expressions in soft counts yield differentiable MCC-losses, supporting efficient learning in imbalanced binary or multiclass regimes (Han et al., 2024).
Cost-sensitive and Weighted Losses: By defining a loss-weight matrix $L$ ( $L_{pq}$ for cost of classifying $p$ as $q$ ), frameworks such as the PAC-Bayesian approach enable flexible penalization schemes (Morvant et al., 2012).
RIC-framework: Probabilistic SOL losses, by varying the CDF $F$ in expected confusion-matrix averaging, facilitate explicit control of the operating threshold and its variance (Marchetti et al., 2021).

7. Empirical Performance and Applications

Empirical studies demonstrate that loss functions directly optimizing confusion-matrix–based metrics yield substantial benefits in highly imbalanced or multilabel settings:

CoMBo Boosting: On UCI datasets with imbalance ratios up to 90:1, CoMBo achieves much lower confusion-matrix norms (e.g., $0.308$ vs. $0.670$) than non-class-balanced AdaBoost.MM, and produces a more balanced error profile across classes (Koço et al., 2013).
sigmoidF1: On multilabel tasks (MS-COCO, Pascal-VOC, arXiv2020, MoviePosters), sigmoidF1 outperforms BCE, focal, and sparse cross-entropy, boosting F1 by 1–2 percentage points and eliminating the need for post-hoc threshold-tuning (Bénédict et al., 2021).
AnyLoss: Across 102 UCI datasets, AnyLoss outperforms standard binary cross-entropy and mean square error on the intended metric in 60–95% of cases, especially on highly imbalanced datasets. Training overhead is minor and tuning of the amplification parameter $L$ is straightforward (Han et al., 2024).
SOL: On forecasting tasks, SOL losses based on F1, TSS, or CSI metrics yield higher domain-relevant scores and faster convergence, with the effective decision threshold concentrated near the desired deployment value (Marchetti et al., 2021).

Table: Representative Confusion-Matrix–Based Loss Frameworks

Framework	Metric Supported	Differentiability	Core Mechanism
CoMBo	Operator norm (multiclass)	Yes (surrogate)	Boosting on exponential costs (Koço et al., 2013)
sigmoidF1	F1, multilabel	Yes	Sigmoid-soft surrogates for counts (Bénédict et al., 2021)
AnyLoss	Any metric (binary)	Yes	Amplifying sigmoid + soft confusion (Han et al., 2024)
SOL	Any metric (binary)	Yes	Threshold-randomized expected confusion (Marchetti et al., 2021)

Additional theoretical frameworks (e.g., SVMs with confusion-norm regularization, PAC-Bayes risk bounds) assert the general applicability and statistical rigor of confusion-matrix–based objectives (Machart et al., 2012, Morvant et al., 2012).

The body of research on confusion matrix–based loss functions demonstrates their flexibility, empirical effectiveness, and strong theoretical foundations. These losses provide a framework for directly optimizing domain-specific metrics, handling class imbalance, and enabling robust multiclass and multilabel modeling, with well-defined differentiable surrogates and concentration-based generalization guarantees.