Balanced Accuracy Metric
- Balanced accuracy is a metric defined as the average of class-wise recall, ensuring that each class is equally weighted regardless of its prevalence.
- It generalizes to multi-class scenarios by computing the macro-average of recalls and addresses the pitfalls of raw accuracy in imbalanced datasets.
- Extensions such as weighted balanced accuracy and group-aware threshold calibration enhance its applicability in fairness, medical diagnostics, and edge-device inference.
Balanced accuracy is a classification performance metric that quantifies the average per-class recall (sensitivity), weighting each class equally regardless of prevalence. It is specifically designed to address the pitfalls of raw accuracy in the presence of class imbalance, providing a robust, prevalence-independent measure for comparing classifiers across diverse application domains including imbalanced learning, medical diagnostics, calibrated decision systems, and edge-device inference. Balanced accuracy and closely related notions arise in both binary and multi-class settings, in threshold selection, and as the limiting case of certain ROC or cost-based analyses.
1. Mathematical Definition and Statistical Foundations
For a binary classifier with true positive rate (TPR or sensitivity) and true negative rate (TNR or specificity), balanced accuracy (BA) is formally given by
where %%%%1%%%% and are the recalls for the positive and negative classes, respectively (Collot et al., 8 Dec 2025, Ferrer, 2022, Gittlin, 29 Aug 2025, Du et al., 2020, Carrington et al., 2021, Cabitza et al., 2019).
In the multi-class setting with classes, let be the number of correctly predicted examples in class and the total number of true examples in class . The per-class accuracy (recall) is , and the balanced accuracy generalizes to
This is a macro-average of class-wise recall and thus naturally extends to problems with or classes (Du et al., 2020, Ferrer, 2022, Cabitza et al., 2019).
A related concept is the balanced error rate:
This symmetry ensures that a trivial majority classifier achieves a lower bound of in the binary case.
2. Connections to Youden’s J Statistic and Expected Cost
Balanced accuracy is closely linked to Youden’s statistic:
With
This monotonic relationship means optimizing or BA yields identical classifier rankings. Theoretical arguments show that BA (or ) is the correct metric for tasks where the aim is to preserve prevalence differences between models, independent of class ratios. BA is also the slope by which differences in underlying class prevalence are preserved in output prevalence estimates—a property critical for fair evaluation in prevalence estimation and LLM judge selection (Collot et al., 8 Dec 2025).
Furthermore, balanced accuracy can be viewed as a special case of the expected cost (EC) for classification when the cost structure assigns for and zero on the diagonal (where is class prevalence), thus penalizing errors on all classes equally regardless of frequency (Ferrer, 2022).
3. Comparative Behavior and Limitations
Unlike standard (unbalanced) accuracy,
which heavily weights common classes, balanced accuracy is prevalence-independent and gives equal influence to each class during macro-averaging. This circumvents the pathologies where trivial or majority-class classifiers appear performant. Balanced accuracy is also label-symmetric and avoids the necessity of designating a positive class, unlike F1 or precision.
However, BA presumes that error rates across all classes are of equal utility or cost—a potential mismatch for many real-world domains (e.g., critical diseases, high-risk fraud detection) (Du et al., 2020, Ferrer, 2022). It also considers only class recall; thus, overpredicting a class may artificially inflate BA in the multiclass case even when the decision process is poorly calibrated or misbalanced with regard to false positives.
To address these issues, Weighted Balanced Accuracy (WBA) introduces class-specific weights , yielding:
Weighting schemes can reflect rarity, cost, or any multi-criteria composite deemed relevant (Du et al., 2020, Cabitza et al., 2019).
4. Role in Imbalanced Learning, Edge Inference, and Fairness
Balanced accuracy is the de facto standard for characterizing classifier quality on imbalanced datasets, as demonstrated empirically across domains such as log parsing, sentiment analysis, and URL filtering (Du et al., 2020). Empirical findings show that in highly imbalanced data, BA sharply contrasts with overall accuracy and aligns with improved performance on rare classes.
In recent task-oriented edge-device inference, balanced accuracy has inspired architectural innovations such as maximizing the minimum pair-wise discriminant gain rather than simply macro-averaging class-pair separations. In the AirComp setting, this has led to direct optimization schemes (e.g., SCA-based power allocation) that explicitly maximize the worst-case Mahalanobis-type separation between every pair of classes at the feature-aggregation stage (Jiao et al., 2024). This approach yields a substantial increase in both overall and balanced accuracy, especially boosting the per-class performance for the least-separable classes, as evidenced by up to 10 percentage point improvements in worst-class accuracy and 5–7 points in four-class BA on human motion benchmarks.
Fairness-motivated generalizations of balanced accuracy operate at the group level. Rather than only macro-averaging class recall across the full population, one can define “worst-group balanced accuracy” and tune classifiers (via group-aware or Pareto-optimal threshold calibration) to directly optimize both overall BA and worst-group BA. This approach outperforms synthetic augmentation (e.g., SMOTE, CT-GAN) and yields sharper trade-offs between global and subgroup-level robustness, as confirmed in financial and census datasets with protected attributes (Gittlin, 29 Aug 2025).
5. Computational Strategies and Extensions
Algorithmic strategies to optimize or calibrate balanced accuracy include:
- Group-aware threshold calibration: Per-group or per-class thresholding to maximize BA or worst-group BA across demographic or clinically relevant subgroups (Gittlin, 29 Aug 2025).
- Min-pairwise separation maximization: Using SCA (successive convex approximation) to maximize the minimum Mahalanobis distance between all class pairs at the feature aggregation stage, preventing class collapse and ensuring balanced inference accuracy (Jiao et al., 2024).
- Weighted or prioritized BA: Setting domain- or risk-based weights via importance scoring, rarity, or user-specified priorities for weighted balanced accuracy (Du et al., 2020, Cabitza et al., 2019).
- Harmony ensemble construction: Explicitly routing “weak” classes to complementary high-capacity models within an ensemble architecture to minimize inter-class accuracy deviation and improve BA (Kim et al., 2019).
Table: Metric Formulations and Extensions
| Metric | Multi-class Formula | Cost-weighted/Weighted Extension |
|---|---|---|
| Balanced Accuracy | ||
| Balanced Error Rate (BER) | $1 -$WBA2 \times \text{BA} - 1J = 2 \times \text{WBA} - 1\text{AUC} = \frac{1}{2}\biggl[\int_0^1 \text{TPR}(x)\,dx + \int_0^1 \text{TNR}(y)\,dy\biggr]J$) is recommended for ROC operating point selection, particularly in fairness-sensitive or imbalanced applications (Collot et al., 8 Dec 2025, Gittlin, 29 Aug 2025).
Partial and groupwise BA can be defined by restricting attention to specific subregions of the ROC space, yielding normalized (concordant) partial AUCs, which correspond to average BA within bands of TPR/FPR or within risk-defined subpopulations. 7. Generalizations and Domain-Specific MetricsBalanced accuracy is a foundational metric but may be insufficient when utility or complexity varies between classes or cases. H-accuracy (Ha) is a generalization incorporating user-defined class priorities, sample-level difficulty (complexity), and model confidence thresholds. Ha subsumes BA as a special case (uniform weights, trivial complexity, zero confidence penalty) and interpolates between accuracy-oriented and utility-oriented evaluation (Cabitza et al., 2019). In edge-device co-inference and over-the-air computation, BA inspires the design of surrogate metrics such as minimum pair-wise discriminant gain, allowing optimization of feature-aggregation processes that ensure no class pair becomes confounded, in contrast to average-case metrics that can obscure performance on minority or boundary classes (Jiao et al., 2024). References
|