Balanced Classification Metrics

Updated 14 July 2025

Balanced classification metrics are performance measures that assess models by equally weighting each class to address imbalances and avoid misleading accuracy results.
They compute averages of per-class recalls and F1-scores to ensure fair representation of minority classes in evaluation.
These metrics support robust evaluation in applications like rare disease detection and fraud prevention by emphasizing balanced performance across all categories.

Balanced classification metrics are a category of performance measures designed to fairly evaluate classification models, particularly in scenarios characterized by class imbalance or disparate class distributions. These metrics address deficiencies inherent in conventional accuracy and related measures, which can obscure poor performance on rare classes or overemphasize majority-class outcomes. Balanced metrics have become central to both methodological research and practical evaluation, playing a vital role in shared tasks, benchmarking, and applications where robust, equitable model assessment is required across all classes.

1. Fundamental Concepts and Definitions

Balanced classification metrics arise from the need to treat each class equitably in evaluation, regardless of their prevalence in the dataset. Conventional accuracy is defined as the fraction of correct predictions out of all instances, but this metric is sensitive to the underlying class distribution: models can achieve high accuracy by favoring the majority class, thus masking poor minority class performance (Grandini et al., 2020).

Balanced accuracy resolves this by averaging per-class recalls (true positive rates): $\text{Balanced Accuracy} = \frac{1}{K} \sum_{k=1}^K \frac{TP_k}{TP_k + FN_k}$ where $TP_k$ and $FN_k$ denote true positives and false negatives for class $k$ , and $K$ is the number of classes (Grandini et al., 2020).

Balanced error rate (BalER) is the counterpart, defined as the mean of per-class error rates, with balanced accuracy being $1 - \text{BalER}$ (Ferrer, 2022). Metrics such as Macro F1-Score, which averages class-wise F1-scores, and Matthews Correlation Coefficient (MCC), which offers a balanced summary even in binary and multiclass settings, are commonly used (Grandini et al., 2020, Gösgens et al., 2022).

A key property of balanced metrics is prevalence invariance: the score does not change with the frequency of classes, ensuring that systems aren't unduly favored or penalized in imbalanced datasets (Opitz, 25 Apr 2024).

2. Theoretical Properties and Axiomatic Frameworks

Analysis of balanced metrics often proceeds via an axiomatic approach, specifying desirable properties such as:

Maximal and Minimal Agreement: Metric achieves its bounds only for perfect agreement or maximal disagreement between prediction and truth.
Symmetry: Swapping the roles of true and predicted labels leaves the value unchanged.
Monotonicity: Improving predictions should not decrease the metric; introducing errors should not increase it.
Constant Baseline: A random classifier yields a known baseline value, independent of class sizes.
Distance Interpretability: The metric or its transformation behaves as a mathematical distance.

The Generalized Means (GM) family, introduced in (Gösgens et al., 2022), provides a rigorous class of measures satisfying all but the distance property: $\text{GM}_r = \frac{n\, c_{11} - a_1 b_1}{\sqrt[r]{\frac{1}{2}(a_1^r a_0^r + b_1^r b_0^r)}}$ with notable special cases:

As $r \rightarrow 0$ , the Matthews Correlation Coefficient is recovered.
For $r = -1$ , the Symmetric Balanced Accuracy (SBA) is defined as

$\text{SBA} = \text{BA}(\mathcal{C}) + \text{BA}(\mathcal{C}^\top) - 1$

where $\mathcal{C}$ is the confusion matrix and $\mathcal{C}^\top$ its transpose.

An impossibility theorem shows that a metric cannot simultaneously satisfy monotonicity, constant baseline, and the properties of a mathematical distance, thus motivating practitioners to consider which of these aspects are most critical for their application (Gösgens et al., 2022).

3. Macro Metrics, Class Sensitivity, and Prevalence Correction

Macro metrics, especially macro-averaged precision, recall, and F1, play a central role in multi-class and imbalanced scenarios. Macro Recall, for instance, is defined as: $\text{macR} = \frac{1}{n} \sum_{i=1}^n R_i \quad \text{with} \quad R_i = \frac{\text{correct}(i)}{\text{prevalence}(i)}$ Macro F1 (macF1) typically takes the arithmetic mean of per-class F1-scores: $\text{macF1} = \frac{1}{n} \sum_{i=1}^n \left(\frac{2 \cdot P_i \cdot R_i}{P_i + R_i}\right)$ where $P_i$ and $R_i$ are class-wise precision and recall (Opitz, 25 Apr 2024).

Prevalence invariance—a defining property for balanced metrics—can also be enforced by "prevalence calibration," adjusting the confusion matrix such that each class contributes equally in the evaluation, irrespective of its frequency in the evaluation data. It is shown that, after such calibration, standard accuracy and macro recall coincide (Opitz, 25 Apr 2024).

Weighted versions of metrics, such as weighted F1, incorporate class frequencies as weights and thus are more sensitive to the actual empirical distribution, sometimes rendering them closer to micro metrics (Grandini et al., 2020, Opitz, 25 Apr 2024).

4. Prevalence Dependence, Adjusted Metrics, and Ranking Instability

In practice, many widely used metrics (especially precision, F1, and area under the Precision-Recall curve) are prevalence-dependent, leading to non-trivial implications for model evaluation, particularly under real-world imbalances. The following adjusted precision formula illustrates this dependence (Brabec et al., 2020): $\text{Prec}(\eta) = \frac{\text{TPR}\cdot\eta}{\text{TPR}\cdot\eta + \text{FPR}\cdot(1-\eta)}$ where $\eta$ is the target positive class prevalence, TPR is the true positive rate, and FPR the false positive rate.

This adjustment enables practitioners to "recalculate" precision (and by extension, F1 and PR-AUC) for any desired prevalence, highlighting that both absolute metric values and even the ranking of classifiers may change drastically as class distribution varies (Brabec et al., 2020). Thus, relying exclusively on a test set "matched" to the expected deployment prevalence is discouraged; the paper demonstrates that reweighting or subsampling can inadvertently inflate estimation variance.

Reporting metrics as functions (e.g., positive-prevalence precision or "P3 curves") provides a richer, scenario-flexible characterization of model performance under varying imbalances.

5. Balanced Metrics in Multi-Class and Real-World Applications

Balanced metrics are particularly salient in multi-class tasks, where class imbalance is common and per-class performance must be monitored robustly (Grandini et al., 2020). The Macro F1, balanced accuracy, and their weighted variants are widely used, with balanced accuracy giving every class equal influence and making the metric insensitive to imbalanced label distributions.

For continuous outputs, metrics such as Area Under ROC Curve (AUC), Brier Score, and cross-entropy extend the evaluation to probabilistic outputs. Proper scoring rules (PSRs) are used to assess both discrimination and calibration, with metrics such as calibration loss providing an interpretable measure of how far system probabilities can be improved by calibration (Ferrer, 2022).

In practice, balanced metrics inform tasks such as rare disease detection, fraud detection, and long-tail recognition, ensuring that deployment decisions are robust even in extreme imbalance scenarios (Galdran et al., 2021).

6. Limitations, Visualization, and Uncertainty Quantification

Balanced metrics, while possessing desirable theoretical properties, are not immune to practical limitations. For example, when minority classes are drastically underrepresented, individual misclassifications can disproportionately impact the average, creating instability or large fluctuations in the metric (Grandini et al., 2020, Gösgens et al., 2022). Visualization techniques have been developed to model and display the uncertainty inherent in confusion-matrix-derived metrics, especially in small sample or imbalanced settings (Lovell et al., 2022).

By representing empirical confusion matrices within ROC or compositional spaces, and visualizing the probability mass functions of metrics under binomial or beta-binomial uncertainty, practitioners can better assess the reliability of reported results and avoid overinterpreting nominal metric differences that are not statistically significant.

Moreover, claims regarding the supposed superiority of certain metrics, such as the Matthews Correlation Coefficient, should be interpreted in the context of uncertainty and application-specific properties, as no single metric satisfies all ideal properties (Lovell et al., 2022, Gösgens et al., 2022).

7. Guidelines and Best Practices in Metric Selection

Recent surveys and critical reflections highlight the lack of transparency and justification in metric selection across shared tasks and research literature (Opitz, 25 Apr 2024). Recommendations include:

Explicitly state and define all evaluation metrics, providing formulas to ensure accuracy in interpretation.
Justify the choice of metrics with respect to the domain and class distribution, especially if the prevalence is expected to shift in deployment.
Report multiple metrics—including per-class scores and uncertainty estimates—rather than relying on a single composite value. This offers a fuller perspective on system behavior.
Consider metric calibration and prevalence invariance, particularly when fairness across classes is required or when operating under possible class prior changes.
Interpret rankings with caution, as system orderings can change under different metrics or prevalence scenarios.

This comprehensive approach enables robust, nuanced, and context-sensitive evaluation of classification systems, aligning model assessment with the demands of both rigorous scientific inquiry and practical deployment.