Papers
Topics
Authors
Recent
Search
2000 character limit reached

Worst-Group Optimization in ML

Updated 9 February 2026
  • Worst-group optimization is a method that explicitly minimizes the maximum loss across groups, ensuring robust performance for minority or under-represented populations.
  • Methodologies include data balancing strategies like subsampling (SUBG) and reweighting (RWG) as well as group-DRO to counteract spurious correlations.
  • Empirical studies show that well-tuned group balancing techniques can significantly boost worst-group accuracy while balancing efficiency and hyperparameter sensitivity.

Worst-group optimization refers to the family of machine learning and optimization methodologies that explicitly target the performance of the worst-performing subpopulation or group in a given data distribution. This approach is motivated by the observation that conventional empirical risk minimization (ERM), which optimizes average accuracy, can yield models that systematically underperform on minority groups when the data exhibits distribution shift or severe group imbalance. Worst-group optimization aims to protect such weakest links in the data, producing models with strong robustness properties across heterogeneous populations.

1. Mathematical Formulation and Problem Scope

Let the training set consist of nn labeled examples (xi,yi,ai)(x_i, y_i, a_i), where xiXx_i \in \mathcal X is the input, yiYy_i \in \mathcal Y is the class label, and aiAa_i \in \mathcal A is an attribute, often associated with a spurious or confounding factor. Define groups G=Y×AG = \mathcal Y \times \mathcal A; each group g=(y,a)g = (y, a) aggregates all examples of class yy and attribute aa. Denote the set of group-specific empirical distributions as {Pg}\{P_g\}.

For model parameters θ\theta (e.g., in fθ:XYf_\theta : \mathcal X \to \mathcal Y), the worst-group (WG) loss is

LWG(θ)=maxgGE(x,y)Pg(fθ(x),y)L_{\text{WG}}(\theta) = \max_{g \in G} \mathbb{E}_{(x, y) \sim P_g} \ell(f_\theta(x), y)

with \ell a per-sample loss (e.g., cross-entropy). The primary objective is to minimize LWG(θ)L_{\text{WG}}(\theta). In classification tasks, performance is often reported as worst-group accuracy: AccWG(θ)=mingG  P(x,y)Pg[fθ(x)=y].\text{Acc}_{\text{WG}}(\theta) = \min_{g\in G}\;\mathbb{P}_{(x, y) \sim P_g}\big[f_\theta(x) = y\big].

This problem setting extends beyond standard average-risk minimization and includes subpopulation shift, fairness, and adversarial robustness contexts.

2. Core Methodological Approaches

2.1 Data Balancing Strategies

Simple data balancing—either by subsampling or reweighting groups—often achieves performance approaching that of more complicated worst-group optimization procedures (Idrissi et al., 2021). Define:

  • SUBG (group-wise subsampling): Subsample each group to the size of the smallest group, discarding surplus samples without replacement. Trains standard ERM on the reduced data.
  • RWG (group-wise reweighting): Assign each example a weight wi1/ng(i)w_i \propto 1/n_{g(i)} (inverse group frequency), thereby ensuring each group has equal expected influence in optimization.

Implementation involves standard minibatch SGD, with weighted loss or sample probabilities per group. SUBG exhibits hyperparameter robustness and computational efficiency.

2.2 Worst-Group Distributionally Robust Optimization (Group-DRO)

Group-DRO sets up a minimax objective,

minθ  maxgG  L^g(θ),\min_{\theta}\; \max_{g\in G}\; \hat{\mathcal{L}}_g(\theta),

where L^g(θ)\hat{\mathcal{L}}_g(\theta) is the empirical loss on group gg (Sagawa et al., 2019). Optimization proceeds via alternating SGD and mirror ascent over group distribution weights qq, typically using an exponentiated-gradient update: qg(t)qg(t1)exp{ηqLg(t1)}q_g^{(t)} \propto q_g^{(t-1)} \exp\{\eta_q\, \mathcal{L}_g^{(t-1)}\} with qq renormalized to the simplex. This preference for worst-off groups underlies theoretical and practical robustness improvements.

2.3 Regularization and Model Selection

Effective worst-group generalization in overparameterized models crucially depends on model selection using validation set group labels and on regularization (e.g., strong 2\ell_2 or early stopping). Without such constraints, Group-DRO and ERM often both reach zero worst-group training loss but diverge dramatically on test performance (Sagawa et al., 2019, Idrissi et al., 2021).

Removing attribute labels at validation eliminates the worst-group performance gains (loss of 10–20 percentage points), while omitting early stopping or weight decay destabilizes group reweighting schemes (Idrissi et al., 2021).

3. Empirical Insights and Benchmark Results

The effectiveness of worst-group optimization is underpinned by empirical results on datasets exhibiting severe group imbalance and spurious feature correlations. For instance, in Waterbirds, CelebA, MultiNLI, and CivilComments:

Method Empirical Test Worst-Group Acc [%]
ERM 73.5 (±1.4)
JTT 74.1
RWY 76.2
SUBY 69.6
RWG 78.4
SUBG 78.8
gDRO 80.5

Group-aware balancing baselines (SUBG, RWG) match or nearly match state-of-the-art methods like gDRO, and are statistically indistinguishable except under weak spurious correlation (MultiNLI), where gDRO offers an advantage (~8 points) (Idrissi et al., 2021). SUBG is approximately seven times faster per epoch than gDRO, due to reduction in training set size and computational simplicity.

Ablation studies confirm that attribute usage—specifically, for checkpoint selection in validation—is essential. Furthermore, subsampling is robust to hyperparameters, while reweighting requires careful early stopping and weight decay.

4. Theoretical Developments

Recent theory clarifies why balancing improves worst-group error. Linear models, trained under heavy-tailed group-conditional features, learn separators whose orientation is biased by the prevalence and extremity of majority group samples. When tails are long, subsampling restores geometric symmetry, centering the decision boundary and yielding lower worst-group error (Chaudhuri et al., 2022).

For features with Gumbel-type tails (including Gaussian mixtures), worst-group error following full ERM scales as 1/pm1/\sqrt{pm} (with pmp\gg m, where pp is majority size), whereas after subsampling it scales as $1/p$. If tails are thin (Weibull or uniform), no improvement is possible by subsampling.

These results provide a geometric perspective: balanced data mitigates spurious correlation-induced tilting of decision boundaries, and helps distributionally robust objectives align with invariants that generalize across groups.

5. Practical Guidelines and Best Practices

Practitioner strategies informed by the evidence base include (Idrissi et al., 2021):

  1. Examine data for class–attribute imbalances as minority-group under-sampling can promote spurious shortcut learning.
  2. Default to group-wise subsampling (SUBG) for strong worst-group performance with reduced implementation and tuning overhead.
  3. If budget allows, employ group reweighting (RWG) with attentive hyperparameter tuning.
  4. Dedicate group annotation efforts primarily to validation/selection splits rather than for full training supervision.
  5. In settings with only mild spurious correlation, prefer soft-max robust optimization (gDRO) to avoid discarding useful majority data outright.

These recommendations reinforce the surprising strength of simple data balancing protocols and clarify where more complex methods, such as full group-DRO, deliver incremental gains.

6. Limitations, Open Problems, and Future Directions

Significant open questions remain:

  • Identifying when and how to automate group-aware balancing or group discovery in the absence of explicit attributes/labels.
  • Quantitatively characterizing conditions (e.g., tail behavior, group overlap) under which balancing subsampling or reweighting closes the group-robustness gap.
  • Developing statistical guarantees for non-convex, overparameterized objectives with limited group annotation.
  • Extending worst-group optimization to structured, overlapping, or continuous groups, and integrating with privacy, fairness, and interpretability constraints.

Recent theoretical and empirical studies suggest that worst-group optimization is not only tractable but highly effective in practice, often requiring little more than mindful balancing of groups at both training and selection stages to realize robust model behavior across heterogeneous, imbalanced populations (Idrissi et al., 2021, Chaudhuri et al., 2022).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Worst-Group Optimization.