Worst Accuracy (WAcc) Metrics

Updated 19 September 2025

Worst Accuracy is a metric that defines the minimal accuracy across subpopulations, exposing vulnerabilities masked by high aggregate performance.
It employs permutation-based worst likely assignments and robust scoring functions to derive statistically meaningful error bounds even in adversarial or low-data regimes.
The concept is critical for fairness, class imbalance correction, and safety-critical evaluations, with recent work exploring spectral regularization and distributionally robust optimization.

Worst Accuracy (WAcc) quantifies the minimal or most adverse accuracy achieved by a classifier or predictive system over some set of subpopulations, tasks, classes, or under worst-case scenarios. As a performance metric, WAcc is specifically designed to expose systematic vulnerabilities and hidden weaknesses that can be masked by high average or aggregate accuracy. In recent literature, WAcc has gained prominence as a critical measure for fairness, safety, certified robustness, small-sample error control, adversarial resilience, and risk-sensitive applications.

1. Mathematical Definition and Problem Formulation

Worst Accuracy (WAcc) is generally defined on the basis of subgroups—in particular, the worst-case performance metric across classes, groups, or under adverse sample assignments. Given a partition of the dataset (e.g., classes in multi-class classification, demographic groups, or synthesized subpopulations), let $Acc_g$ denote the accuracy on group $g$ . Then:

$\text{WAcc} = \min_{g \in \mathcal{G}} Acc_g$

where $\mathcal{G}$ is the set of all groups, classes, or tasks. In robust generalization frameworks and permutation-based validation, the definition may instead relate to the worst-case accuracy across "likely" assignments or adversarially defined input distributions.

In the setting of worst likely assignments, the maximum error rate (hence, minimal accuracy) is computed over all plausible assignments of working example labels that pass permutation tests, yielding a statistical guarantee:

$\Pr \left[E_C \leq \max_{a \in L} E_C(a) \right] \geq 1 - \delta$

where $L$ is the set of likely assignments defined via permutation testing, and $E_C(a)$ is the error rate for assignment $a$ (Bax, 2015).

2. Worst Likely Assignments and Permutation-Based Error Bounds

Worst likely assignment (WLA) error bounds are a principled approach from permutation test theory to control worst-case accuracy for classifiers, especially with limited data (Bax, 2015). The method constructs a set of likely labelings for the unknown labels (the "working examples"), determined by a scoring function $h(\cdot)$ and a permutation-based ranking. Assignments $a$ are retained in $L$ if their score is sufficiently typical (i.e., not outlying under permutations). The key error guarantee is:

$\Pr \left[ E_C \leq \max_{a \in L} E_C(a) \right] \geq 1-\delta$

A near neighbor scoring function (NNSF) is introduced to robustify the test: it aggregates discrepancies over the $k$ nearest training neighbors for each working example, with geometric decay ( $\alpha^{i-1}$ ) in weight for more distant neighbors:

$f_{\alpha, k, S}(c) = \sum_{i=1}^k \sum_{j=t+1}^{t+w} \alpha^{i-1} I(n_{ijS}(c) \neq y_j)$

The NNSF tightens error bounds compared to traditional 1-nearest-neighbor scoring (ESF), especially when the classifier is highly accurate and local class consistency is strong. Empirical evidence shows that for accurate classifiers and structured data, worst-case error bounds can closely track the actual error, even with only 100 examples.

3. Data Structure, Imbalance, and the Geometry of Worst Accuracy

Worst-case accuracy is critically affected by data geometry and imbalance. For instance, in imbalanced classification, the worst-group error (one minus worst accuracy) often increases with the dominance of majority groups due to heavy tails in the data distribution. Extreme value theory reveals that maxima from large groups "outrun" those of minority groups, distorting margin-based classifiers and increasing bias (Chaudhuri et al., 2022).

Subsampling the majority group to enforce group balance restores geometric symmetry and reduces worst-group error (thus increasing WAcc). For linearly separable settings, the worst-group error for ERM is lower-bounded by

$wce(\theta_{ERM}) \geq 1 - F\left( \frac{U(p)(1 + 3\lambda) + U(m)(1 - 3\lambda)}{2} \right)$

where $U(n)$ is the extreme-value tail function, $F$ is the CDF, $p$ and $m$ are group sizes, and $\lambda$ is small under balance (Chaudhuri et al., 2022).

4. Robustness, Fairness, and Group DRO

Distributionally robust optimization (DRO) frameworks elevate WAcc as a primary performance metric, particularly for fairness and anti-spuriousness. In group-DRO, the training objective is

$\min_{\theta} \max_{g \in \mathcal{G}} L_g(\theta)$

where $L_g$ is the group-specific (empirical) loss (Sagawa et al., 2019). Strong regularization (e.g., large $L_2$ penalties or early stopping) is necessary to prevent overparameterized models from driving all group losses to zero on the training set and losing groupwise discrimination at test time. With correct regularization, group DRO can yield 10–40 percentage point improvements in worst-group accuracy over ERM, particularly in settings with spurious correlations and minority groups suffering from excessive error (Sagawa et al., 2019).

Extensions for weak group supervision (limited group annotations) have been devised, such as the SSA pseudo-attribute approach, which generates pseudo-labels for unlabeled data and uses group-DRO with adaptive thresholds to recover nearly the same WAcc as full group annotation with only 0.6–1.5% group-labeled data (Nam et al., 2022).

5. Worst-Class Certification and Spectral Regularization

In certified robustness, worst-class accuracy (the minimal certified robust accuracy across classes) is directly tied to the spectral norm (largest eigenvalue) of the confusion matrix for smoothed classifiers. A PAC-Bayesian upper bound is developed:

$\max_{(x, y) \sim D} P(f_w, v(x) \neq y) \leq p \cdot \lambda_{\max}(C^s_{f_w, v}) + O\left(\frac{\Psi(f_w) + \ln(n/m_{min})}{m_{min} - 8d_y}\right)$

where $C^s_{f_w, v}$ is the smoothed empirical confusion matrix, $\lambda_{\max}$ its largest eigenvalue, and $\Psi(f_w)$ encapsulates network capacity terms (Jin et al., 21 Mar 2025). Principal eigenvalue regularization—minimizing $\lambda_{\max}$ via surrogate differentiable gradients—improves the uniformity of certified robustness and raises WAcc (worst-class certified accuracy) without sacrificing overall robustness.

6. Robustness to Worst-Case Noise, Adversaries, and Device Variations

WAcc is central in safety-critical applications under adversarial perturbations or hardware variability. In adversarial training methods for deep neural networks, min–max objectives (minimizing the maximal class-wise robust risk) are solved with no-regret dynamics or distributionally robust algorithms (Li et al., 2023, Pethick et al., 2023). Empirical findings confirm that explicitly boosting the worst-class robust accuracy can be achieved with only minor average accuracy sacrifices, and is quantifiable through new metrics such as ρ, which tracks the gain in worst-class robustness versus the drop in overall accuracy (Li et al., 2023).

In neural network accelerators based on compute-in-memory, worst-case accuracy is evaluated as the minimal accuracy under the strongest admissible hardware perturbations (bounded weight deviations). Standard approaches focused on average robustness fail to address rare, catastrophic cases. Adversarial training with right-censored noise and rapid worst-case identification (A-TRICE) improves WAcc by up to 33%, supporting the critical need for explicit worst-case analysis to ensure safe deployment (Yan et al., 2023).

7. Practical Implications and Applications

Small data regimes: Permutation-based worst likely assignment bounds enable non-asymptotic, certifiable WAcc with very limited data (Bax, 2015).
Fairness and bias: WAcc is the operational metric for detecting and correcting fairness violations, ensuring no subgroup is systematically underserved (Idrissi et al., 2021, Du et al., 2023).
Long-tailed learning and class imbalance: A focus on worst-class recall (worst accuracy) leads to geometric/harmonic mean loss objectives and plug-in ensemble methods that raise performance for the most underrepresented categories (Du et al., 2023).
Multitask learning: Worst-case-aware regularization and lookahead optimization address failure modes of naïve DRO, stabilizing multitask generalization across tasks (Michel et al., 2021).
Distribution shift and downstream decision loss: Hierarchical modeling and submodular optimization frameworks reveal that worst-case risks measured with respect to downstream decisions can diverge from those identified via standard accuracy, underscoring the context-dependent relevance of WAcc (Ren et al., 4 Jul 2024).

8. Limitations and Open Directions

WAcc is a strong performance criterion but can present several tradeoffs:

Maximizing WAcc may degrade average-case accuracy in certain adversarial or tail-sensitive optimization regimes (Li et al., 2023).
The practical computation of WAcc can be challenging when group or class structure is latent, annotations are incomplete, or the worst-case scenario is combinatorial (see resource allocation under distribution shift (Ren et al., 4 Jul 2024)).
Optimizing WAcc in deep or highly nonlinear models remains an area of ongoing methodological research, especially for settings where geometry, distribution tails, or adversarial shifts critically affect minimal performance (Chaudhuri et al., 2022).

Ongoing work explores further integration of spectral regularization, structured data modeling, and efficient submodular optimization to make worst-case accuracy both a central and tractable goal in robust machine learning and decision-theoretic applications.