Worst-Group Optimization in ML

Updated 9 February 2026

Worst-group optimization is a method that explicitly minimizes the maximum loss across groups, ensuring robust performance for minority or under-represented populations.
Methodologies include data balancing strategies like subsampling (SUBG) and reweighting (RWG) as well as group-DRO to counteract spurious correlations.
Empirical studies show that well-tuned group balancing techniques can significantly boost worst-group accuracy while balancing efficiency and hyperparameter sensitivity.

Worst-group optimization refers to the family of machine learning and optimization methodologies that explicitly target the performance of the worst-performing subpopulation or group in a given data distribution. This approach is motivated by the observation that conventional empirical risk minimization (ERM), which optimizes average accuracy, can yield models that systematically underperform on minority groups when the data exhibits distribution shift or severe group imbalance. Worst-group optimization aims to protect such weakest links in the data, producing models with strong robustness properties across heterogeneous populations.

1. Mathematical Formulation and Problem Scope

Let the training set consist of $n$ labeled examples $(x_i, y_i, a_i)$ , where $x_i \in \mathcal X$ is the input, $y_i \in \mathcal Y$ is the class label, and $a_i \in \mathcal A$ is an attribute, often associated with a spurious or confounding factor. Define groups $G = \mathcal Y \times \mathcal A$ ; each group $g = (y, a)$ aggregates all examples of class $y$ and attribute $a$ . Denote the set of group-specific empirical distributions as $\{P_g\}$ .

For model parameters $\theta$ (e.g., in $f_\theta : \mathcal X \to \mathcal Y$ ), the worst-group (WG) loss is

$L_{\text{WG}}(\theta) = \max_{g \in G} \mathbb{E}_{(x, y) \sim P_g} \ell(f_\theta(x), y)$

with $\ell$ a per-sample loss (e.g., cross-entropy). The primary objective is to minimize $L_{\text{WG}}(\theta)$ . In classification tasks, performance is often reported as worst-group accuracy: $\text{Acc}_{\text{WG}}(\theta) = \min_{g\in G}\;\mathbb{P}_{(x, y) \sim P_g}\big[f_\theta(x) = y\big].$

This problem setting extends beyond standard average-risk minimization and includes subpopulation shift, fairness, and adversarial robustness contexts.

2. Core Methodological Approaches

2.1 Data Balancing Strategies

Simple data balancing—either by subsampling or reweighting groups—often achieves performance approaching that of more complicated worst-group optimization procedures (Idrissi et al., 2021). Define:

SUBG (group-wise subsampling): Subsample each group to the size of the smallest group, discarding surplus samples without replacement. Trains standard ERM on the reduced data.
RWG (group-wise reweighting): Assign each example a weight $w_i \propto 1/n_{g(i)}$ (inverse group frequency), thereby ensuring each group has equal expected influence in optimization.

Implementation involves standard minibatch SGD, with weighted loss or sample probabilities per group. SUBG exhibits hyperparameter robustness and computational efficiency.

2.2 Worst-Group Distributionally Robust Optimization (Group-DRO)

Group-DRO sets up a minimax objective,

$\min_{\theta}\; \max_{g\in G}\; \hat{\mathcal{L}}_g(\theta),$

where $\hat{\mathcal{L}}_g(\theta)$ is the empirical loss on group $g$ (Sagawa et al., 2019). Optimization proceeds via alternating SGD and mirror ascent over group distribution weights $q$ , typically using an exponentiated-gradient update: $q_g^{(t)} \propto q_g^{(t-1)} \exp\{\eta_q\, \mathcal{L}_g^{(t-1)}\}$ with $q$ renormalized to the simplex. This preference for worst-off groups underlies theoretical and practical robustness improvements.

2.3 Regularization and Model Selection

Effective worst-group generalization in overparameterized models crucially depends on model selection using validation set group labels and on regularization (e.g., strong $\ell_2$ or early stopping). Without such constraints, Group-DRO and ERM often both reach zero worst-group training loss but diverge dramatically on test performance (Sagawa et al., 2019, Idrissi et al., 2021).

Removing attribute labels at validation eliminates the worst-group performance gains (loss of 10–20 percentage points), while omitting early stopping or weight decay destabilizes group reweighting schemes (Idrissi et al., 2021).

3. Empirical Insights and Benchmark Results

The effectiveness of worst-group optimization is underpinned by empirical results on datasets exhibiting severe group imbalance and spurious feature correlations. For instance, in Waterbirds, CelebA, MultiNLI, and CivilComments:

Method	Empirical Test Worst-Group Acc [%]
ERM	73.5 (±1.4)
JTT	74.1
RWY	76.2
SUBY	69.6
RWG	78.4
SUBG	78.8
gDRO	80.5

Group-aware balancing baselines (SUBG, RWG) match or nearly match state-of-the-art methods like gDRO, and are statistically indistinguishable except under weak spurious correlation (MultiNLI), where gDRO offers an advantage (~8 points) (Idrissi et al., 2021). SUBG is approximately seven times faster per epoch than gDRO, due to reduction in training set size and computational simplicity.

Ablation studies confirm that attribute usage—specifically, for checkpoint selection in validation—is essential. Furthermore, subsampling is robust to hyperparameters, while reweighting requires careful early stopping and weight decay.

4. Theoretical Developments

Recent theory clarifies why balancing improves worst-group error. Linear models, trained under heavy-tailed group-conditional features, learn separators whose orientation is biased by the prevalence and extremity of majority group samples. When tails are long, subsampling restores geometric symmetry, centering the decision boundary and yielding lower worst-group error (Chaudhuri et al., 2022).

For features with Gumbel-type tails (including Gaussian mixtures), worst-group error following full ERM scales as $1/\sqrt{pm}$ (with $p\gg m$ , where $p$ is majority size), whereas after subsampling it scales as $1/p$. If tails are thin (Weibull or uniform), no improvement is possible by subsampling.

These results provide a geometric perspective: balanced data mitigates spurious correlation-induced tilting of decision boundaries, and helps distributionally robust objectives align with invariants that generalize across groups.

5. Practical Guidelines and Best Practices

Practitioner strategies informed by the evidence base include (Idrissi et al., 2021):

Examine data for class–attribute imbalances as minority-group under-sampling can promote spurious shortcut learning.
Default to group-wise subsampling (SUBG) for strong worst-group performance with reduced implementation and tuning overhead.
If budget allows, employ group reweighting (RWG) with attentive hyperparameter tuning.
Dedicate group annotation efforts primarily to validation/selection splits rather than for full training supervision.
In settings with only mild spurious correlation, prefer soft-max robust optimization (gDRO) to avoid discarding useful majority data outright.

These recommendations reinforce the surprising strength of simple data balancing protocols and clarify where more complex methods, such as full group-DRO, deliver incremental gains.

6. Limitations, Open Problems, and Future Directions

Significant open questions remain:

Identifying when and how to automate group-aware balancing or group discovery in the absence of explicit attributes/labels.
Quantitatively characterizing conditions (e.g., tail behavior, group overlap) under which balancing subsampling or reweighting closes the group-robustness gap.
Developing statistical guarantees for non-convex, overparameterized objectives with limited group annotation.
Extending worst-group optimization to structured, overlapping, or continuous groups, and integrating with privacy, fairness, and interpretability constraints.

Recent theoretical and empirical studies suggest that worst-group optimization is not only tractable but highly effective in practice, often requiring little more than mindful balancing of groups at both training and selection stages to realize robust model behavior across heterogeneous, imbalanced populations (Idrissi et al., 2021, Chaudhuri et al., 2022).

Markdown Report Issue Upgrade to Chat

References (3)

Simple data balancing achieves competitive worst-group-accuracy (2021)

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization (2019)

Why does Throwing Away Data Improve Worst-Group Error? (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Worst-Group Optimization.

Worst-Group Optimization in ML

1. Mathematical Formulation and Problem Scope

2. Core Methodological Approaches

2.1 Data Balancing Strategies

2.2 Worst-Group Distributionally Robust Optimization (Group-DRO)

2.3 Regularization and Model Selection

3. Empirical Insights and Benchmark Results

4. Theoretical Developments

5. Practical Guidelines and Best Practices

6. Limitations, Open Problems, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Worst-Group Optimization in ML

1. Mathematical Formulation and Problem Scope

2. Core Methodological Approaches

2.1 Data Balancing Strategies

2.2 Worst-Group Distributionally Robust Optimization (Group-DRO)

2.3 Regularization and Model Selection

3. Empirical Insights and Benchmark Results

4. Theoretical Developments

5. Practical Guidelines and Best Practices

6. Limitations, Open Problems, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research