Balanced Accuracy Metric

Updated 10 December 2025

Balanced accuracy is a metric defined as the average of class-wise recall, ensuring that each class is equally weighted regardless of its prevalence.
It generalizes to multi-class scenarios by computing the macro-average of recalls and addresses the pitfalls of raw accuracy in imbalanced datasets.
Extensions such as weighted balanced accuracy and group-aware threshold calibration enhance its applicability in fairness, medical diagnostics, and edge-device inference.

Balanced accuracy is a classification performance metric that quantifies the average per-class recall (sensitivity), weighting each class equally regardless of prevalence. It is specifically designed to address the pitfalls of raw accuracy in the presence of class imbalance, providing a robust, prevalence-independent measure for comparing classifiers across diverse application domains including imbalanced learning, medical diagnostics, calibrated decision systems, and edge-device inference. Balanced accuracy and closely related notions arise in both binary and multi-class settings, in threshold selection, and as the limiting case of certain ROC or cost-based analyses.

1. Mathematical Definition and Statistical Foundations

For a binary classifier with true positive rate (TPR or sensitivity) and true negative rate (TNR or specificity), balanced accuracy (BA) is formally given by

$\text{BA} = \frac{1}{2}(\text{TPR} + \text{TNR}) = \frac{1}{2}(\text{Recall}_\text{pos} + \text{Recall}_\text{neg})$

where $\text{Recall}_\text{pos}$ and $\text{Recall}_\text{neg}$ are the recalls for the positive and negative classes, respectively (Collot et al., 8 Dec 2025, Ferrer, 2022, Gittlin, 29 Aug 2025, Du et al., 2020, Carrington et al., 2021, Cabitza et al., 2019).

In the multi-class setting with $C$ classes, let $p_{i}$ be the number of correctly predicted examples in class $i$ and $n_{i}$ the total number of true examples in class $i$ . The per-class accuracy (recall) is $p_i / n_i$ , and the balanced accuracy generalizes to

$\text{BA} = \frac{1}{C} \sum_{i=1}^C \frac{p_i}{n_i}$

This is a macro-average of class-wise recall and thus naturally extends to problems with $\text{Recall}_\text{pos}$ 0 or $\text{Recall}_\text{pos}$ 1 classes (Du et al., 2020, Ferrer, 2022, Cabitza et al., 2019).

A related concept is the balanced error rate:

$\text{Recall}_\text{pos}$ 2

This symmetry ensures that a trivial majority classifier achieves a lower bound of $\text{Recall}_\text{pos}$ 3 in the binary case.

2. Connections to Youden’s J Statistic and Expected Cost

Balanced accuracy is closely linked to Youden’s $\text{Recall}_\text{pos}$ 4 statistic:

$\text{Recall}_\text{pos}$ 5

With

$\text{Recall}_\text{pos}$ 6

This monotonic relationship means optimizing $\text{Recall}_\text{pos}$ 7 or BA yields identical classifier rankings. Theoretical arguments show that BA (or $\text{Recall}_\text{pos}$ 8) is the correct metric for tasks where the aim is to preserve prevalence differences between models, independent of class ratios. BA is also the slope by which differences in underlying class prevalence are preserved in output prevalence estimates—a property critical for fair evaluation in prevalence estimation and LLM judge selection (Collot et al., 8 Dec 2025).

Furthermore, balanced accuracy can be viewed as a special case of the expected cost (EC) for classification when the cost structure assigns $\text{Recall}_\text{pos}$ 9 for $\text{Recall}_\text{neg}$ 0 and zero on the diagonal (where $\text{Recall}_\text{neg}$ 1 is class prevalence), thus penalizing errors on all classes equally regardless of frequency (Ferrer, 2022).

3. Comparative Behavior and Limitations

Unlike standard (unbalanced) accuracy,

$\text{Recall}_\text{neg}$ 2

which heavily weights common classes, balanced accuracy is prevalence-independent and gives equal influence to each class during macro-averaging. This circumvents the pathologies where trivial or majority-class classifiers appear performant. Balanced accuracy is also label-symmetric and avoids the necessity of designating a positive class, unlike F1 or precision.

However, BA presumes that error rates across all classes are of equal utility or cost—a potential mismatch for many real-world domains (e.g., critical diseases, high-risk fraud detection) (Du et al., 2020, Ferrer, 2022). It also considers only class recall; thus, overpredicting a class may artificially inflate BA in the multiclass case even when the decision process is poorly calibrated or misbalanced with regard to false positives.

To address these issues, Weighted Balanced Accuracy (WBA) introduces class-specific weights $\text{Recall}_\text{neg}$ 3, yielding:

$\text{Recall}_\text{neg}$ 4

Weighting schemes can reflect rarity, cost, or any multi-criteria composite deemed relevant (Du et al., 2020, Cabitza et al., 2019).

4. Role in Imbalanced Learning, Edge Inference, and Fairness

Balanced accuracy is the de facto standard for characterizing classifier quality on imbalanced datasets, as demonstrated empirically across domains such as log parsing, sentiment analysis, and URL filtering (Du et al., 2020). Empirical findings show that in highly imbalanced data, BA sharply contrasts with overall accuracy and aligns with improved performance on rare classes.

In recent task-oriented edge-device inference, balanced accuracy has inspired architectural innovations such as maximizing the minimum pair-wise discriminant gain rather than simply macro-averaging class-pair separations. In the AirComp setting, this has led to direct optimization schemes (e.g., SCA-based power allocation) that explicitly maximize the worst-case Mahalanobis-type separation between every pair of classes at the feature-aggregation stage (Jiao et al., 2024). This approach yields a substantial increase in both overall and balanced accuracy, especially boosting the per-class performance for the least-separable classes, as evidenced by up to 10 percentage point improvements in worst-class accuracy and 5–7 points in four-class BA on human motion benchmarks.

Fairness-motivated generalizations of balanced accuracy operate at the group level. Rather than only macro-averaging class recall across the full population, one can define “worst-group balanced accuracy” and tune classifiers (via group-aware or Pareto-optimal threshold calibration) to directly optimize both overall BA and worst-group BA. This approach outperforms synthetic augmentation (e.g., SMOTE, CT-GAN) and yields sharper trade-offs between global and subgroup-level robustness, as confirmed in financial and census datasets with protected attributes (Gittlin, 29 Aug 2025).

5. Computational Strategies and Extensions

Algorithmic strategies to optimize or calibrate balanced accuracy include:

Group-aware threshold calibration: Per-group or per-class thresholding to maximize BA or worst-group BA across demographic or clinically relevant subgroups (Gittlin, 29 Aug 2025).
Min-pairwise separation maximization: Using SCA (successive convex approximation) to maximize the minimum Mahalanobis distance between all class pairs at the feature aggregation stage, preventing class collapse and ensuring balanced inference accuracy (Jiao et al., 2024).
Weighted or prioritized BA: Setting domain- or risk-based weights via importance scoring, rarity, or user-specified priorities for weighted balanced accuracy (Du et al., 2020, Cabitza et al., 2019).
Harmony ensemble construction: Explicitly routing “weak” classes to complementary high-capacity models within an ensemble architecture to minimize inter-class accuracy deviation and improve BA (Kim et al., 2019).

Table: Metric Formulations and Extensions

Metric	Multi-class Formula	Cost-weighted/Weighted Extension
Balanced Accuracy	$\text{Recall}_\text{neg}$ 5	$\text{Recall}_\text{neg}$ 6
Balanced Error Rate (BER)	$\text{Recall}_\text{neg}$ 7	$\text{Recall}_\text{neg}$ 8WBA $</td> </tr> <tr> <td>Youden’s J</td> <td>$ \text{Recall}_\text{neg}$9	$C$0

6. Relationship to ROC, AUC, and Threshold Optimization

Carrington et al. (Carrington et al., 2021) establish that the area under the ROC curve (AUC) can be interpreted as a balanced average accuracy across all possible decision thresholds:

$C$1

This equivalence provides a bridge between point-wise BA at a given threshold and global discriminability as measured by ROC analysis. Threshold selection via maximizing balanced accuracy (or Youden’s $C$2) is recommended for ROC operating point selection, particularly in fairness-sensitive or imbalanced applications (Collot et al., 8 Dec 2025, Gittlin, 29 Aug 2025).

Partial and groupwise BA can be defined by restricting attention to specific subregions of the ROC space, yielding normalized (concordant) partial AUCs, which correspond to average BA within bands of TPR/FPR or within risk-defined subpopulations.

7. Generalizations and Domain-Specific Metrics

Balanced accuracy is a foundational metric but may be insufficient when utility or complexity varies between classes or cases. H-accuracy (Ha) is a generalization incorporating user-defined class priorities, sample-level difficulty (complexity), and model confidence thresholds. Ha subsumes BA as a special case (uniform weights, trivial complexity, zero confidence penalty) and interpolates between accuracy-oriented and utility-oriented evaluation (Cabitza et al., 2019).

In edge-device co-inference and over-the-air computation, BA inspires the design of surrogate metrics such as minimum pair-wise discriminant gain, allowing optimization of feature-aggregation processes that ensure no class pair becomes confounded, in contrast to average-case metrics that can obscure performance on minority or boundary classes (Jiao et al., 2024).

References

Task-oriented Over-the-air Computation for Edge-device Co-inference with Balanced Classification Accuracy (Jiao et al., 2024)
Balanced Accuracy: The Right Metric for Evaluating LLM Judges - Explained through Youden's J statistic (Collot et al., 8 Dec 2025)
A Skew-Sensitive Evaluation Framework for Imbalanced Data Classification (Du et al., 2020)
Analysis and Comparison of Classification Metrics (Ferrer, 2022)
Beyond Synthetic Augmentation: Group-Aware Threshold Calibration for Robust Balanced Accuracy in Imbalanced Learning (Gittlin, 29 Aug 2025)
Deep ensemble network with explicit complementary model for accuracy-balanced classification (Kim et al., 2019)
Deep ROC Analysis and AUC as Balanced Average Accuracy to Improve Model Selection, Understanding and Interpretation (Carrington et al., 2021)
Who wants accurate models? Arguing for a different metrics to take classification models seriously (Cabitza et al., 2019)