Papers
Topics
Authors
Recent
2000 character limit reached

SharpDRO: Sharpness-Aware Robust Optimization

Updated 13 January 2026
  • SharpDRO is a robust optimization method that integrates sharpness-aware penalties to mitigate overfitting on rare, severely corrupted data.
  • It employs a min–max–min structure with per-example loss perturbations to form a flat and reliable loss landscape for worst-case examples.
  • Empirical results on CIFAR and ImageNet benchmarks show significant gains in robustness, particularly for high-severity corruptions, outperforming standard DRO.

SharpDRO is a robust optimization method—“Sharpness-aware Distributionally Robust Optimization”—designed to achieve robust generalization on data mixtures where rare, severely corrupted examples (notably photon-limited corruptions) are present. Unlike traditional Distributionally Robust Optimization (DRO), which minimizes the worst-case empirical risk and consequently may produce sharp loss landscapes with poor test generalization, SharpDRO augments the standard DRO formulation with an explicit sharpness minimization penalty concentrated on the hardest (worst-case) distributions or examples. This strategy promotes solutions that not only achieve low risk but are also robust to local perturbations in the parameter space for the most challenging subsets of data (Huang et al., 2023).

1. Formal Problem Statement and Objective

Let Θ\Theta denote the parameter space and L(θ;(x,y))\mathcal{L}(\theta; (x, y)) the per-example loss. Training data are drawn from a mixture of sub-distributions P0,,PSP_0, \ldots, P_S reflecting different corruption severities s=0,,Ss = 0, \dots, S with sPoi(λ)s \sim \mathrm{Poi}(\lambda). The overall data distribution is P=sPsP = \sum_s P_s.

Traditional DRO formulates robust risk minimization as: minθΘmaxQQE(x,y)Q[L(θ;(x,y))]\min_{\theta \in \Theta} \max_{Q \in \mathcal{Q}} \, \mathbb{E}_{(x, y) \sim Q}[\mathcal{L}(\theta; (x, y))] where Q\mathcal{Q} is typically an ff-divergence ball or a set of mixture distributions.

SharpDRO introduces a sharpness penalty, focusing the sharpness minimization on the worst-case empirical distribution QQ^*. The objective becomes: minθΘ{E(x,y)Q[L(θ;(x,y))]+E(x,y)Q[R(θ;(x,y))]}\min_{\theta \in \Theta} \left\{ \mathbb{E}_{(x, y) \sim Q^*}[\mathcal{L}(\theta; (x, y))] + \mathbb{E}_{(x, y) \sim Q^*}[\mathcal{R}(\theta; (x, y))] \right\} where R(θ;(x,y))\mathcal{R}(\theta; (x, y)) measures the sharpness of the loss landscape locally at θ\theta for each sample. QQ^* can be parameterized by weights ω\omega over sub-distributions (distribution-aware) or per-example out-of-distribution (OOD) scores ωi\omega_i (distribution-agnostic), leading to: minθmaxωΔ(Ei[ωiL(θ;(xi,yi))]+Ei[ωiR(θ;(xi,yi))])\min_{\theta} \max_{\omega \in \Delta} \left( \mathbb{E}_i [\omega_i \mathcal{L}(\theta; (x_i, y_i))] + \mathbb{E}_i [\omega_i \mathcal{R}(\theta; (x_i, y_i))] \right) with Δ\Delta being the appropriate simplex.

2. Sharpness Definition and Computation

SharpDRO quantifies sharpness per example as the maximum increase in loss under an 2\ell_2-bounded parameter perturbation: R(θ;(x,y))=maxϵ2ρ{L(θ+ϵ;(x,y))L(θ;(x,y))}\mathcal{R}(\theta; (x, y)) = \max_{\|\epsilon\|_2 \leq \rho} \left\{ \mathcal{L}(\theta + \epsilon; (x, y)) - \mathcal{L}(\theta; (x, y)) \right\} For smooth L\mathcal{L} and small ρ\rho, first-order approximation yields

ϵ=ρθL(θ;(x,y))θL(θ;(x,y))2\epsilon^* = \rho \frac{\nabla_\theta \mathcal{L}(\theta; (x, y))}{\|\nabla_\theta \mathcal{L}(\theta; (x, y))\|_2}

Implementationally, sharpness can be computed via two per-batch forward/backward passes, evaluating the loss at θ\theta and θ+ϵ\theta + \epsilon^*.

3. Optimization Structure and Algorithm

Standard DRO is a min–max problem: minθmaxωΔE[ωL(θ)]\min_\theta \max_{\omega \in \Delta} \mathbb{E}[\omega \mathcal{L}(\theta)] SharpDRO extends this to a min–max–min (or min–max–max) structure by penalizing sharpness only for the worst-case distribution: minθ{Ei[ωi(θ)L(θ;(xi,yi))]+Ei[ωi(θ)R(θ;(xi,yi))]}\min_\theta \left\{ \mathbb{E}_i [\omega^*_i(\theta) \mathcal{L}(\theta; (x_i, y_i))] + \mathbb{E}_i [\omega^*_i(\theta) \mathcal{R}(\theta; (x_i, y_i))] \right\} subject to ω(θ)=argmaxωΔE[ωL(θ)]\omega^*(\theta) = \arg\max_{\omega \in \Delta} \mathbb{E}[\omega \mathcal{L}(\theta)].

The iterative algorithm follows these main steps:

  • Max-step (worst-case reweighting): Find or update ω\omega to focus on hardest (group or example) distributions.
  • Min-step (parameter update): Update θ\theta with respect to risk and sharpness under ω\omega.

Training Pseudocode (abbreviated)

1
2
3
4
5
6
7
8
9
10
11
12
13
for t = 0 ... T-1:
    # Max-step: update ω
    if distribution-aware:
        ω_{t+1}  argmax_{ωΔ} Σ_s ω_s E_{(x,y)P_s}[ℒ(θ_t;(x,y))]
    else:
        # OOD scoring: ω_i ∝ max f(θ_t;x_i) − max f(θ_t+ε*;x_i)
        normalize ω to simplex

    # Min-step: update θ on risk + sharpness
    L = E_{i in batch}[ω_{t+1,i} ℒ(θ_t;(x_i,y_i))]
    θ' = θ_t + ρ * ∇θ L₁ / ‖∇θ L₁‖
    L = E_{i in batch}[ω_{t+1,i} ℒ(θ';(x_i,y_i))]
    θ_{t+1} = θ_t  η_θ [θ L + (θ L  θ L)]

4. Theoretical Properties

Under the following conditions:

  • L(θ,ω;(x,y))\mathcal{L}(\theta, \omega; (x, y)) is differentiable and LL-smooth in θ\theta and ω\omega.
  • In ω\omega, L\mathcal{L} satisfies a Polyak–Łojasiewicz (PL) condition with constant μ>0\mu > 0.
  • Stochastic gradients have bounded variance σ2\sigma^2.

Defining L(θ)=maxωL(θ,ω)L^*(\theta) = \max_\omega \mathcal{L}(\theta, \omega), the SharpDRO training loop converges to an ϵ\epsilon-stationary point: 1Tt=0T1EL(θt)2=O(κ2MT)\frac{1}{T} \sum_{t=0}^{T-1} \mathbb{E} \|\nabla L^*(\theta_t)\|^2 = \mathcal{O}\left(\frac{\kappa^2}{\sqrt{MT}}\right) where κ=L/μ\kappa = L/\mu and MM is the batch size. Achieving EL2ϵ2\mathbb{E}\|\nabla L^*\|^2 \leq \epsilon^2 requires T=O(κ4/(ϵ4M))T = \mathcal{O} (\kappa^4 / (\epsilon^4 M)).

The proof leverages Danskin’s theorem to show smoothness, and constructs a potential function to analyze joint suboptimality in ω\omega and descent in θ\theta.

5. Experimental Methodology

Experiments are conducted using CIFAR-10, CIFAR-100, and ImageNet30. The backbone is Wide ResNet-28-2, trained with SGD (learning rate 0.03, momentum 0.9, weight decay 5×1045 \times 10^{-4}), 200 epochs, and batch size 128.

Corruptions: For each sample, severity ss is sampled from Poi(λ=1)\mathrm{Poi}(\lambda=1), i.e., P(s=0..5){0.37,0.37,0.18,0.06,0.015,0.003}P(s=0..5) \approx \{0.37, 0.37, 0.18, 0.06, 0.015, 0.003\}. Four corruption types (Gaussian noise, JPEG compression, Snow, Shot noise) are applied, following [Hendrycks & Dietterich, ICLR’19]. Clean images correspond to s=0s=0.

Evaluation protocols: Test accuracy is reported per-severity and averaged over all ss.

Baselines:

  • Distribution-aware: ERM, IRM, REx, GroupDRO
  • Distribution-agnostic: Just‐Train‐Twice (JTT), EIIL

Hyperparameters: Perturb radius ρ=0.05\rho = 0.05 (as in SAM), with learning rates tuned via small validation set.

6. Empirical Findings

  • Robustness across severities: On Gaussian-noise CIFAR-10, SharpDRO yields a +4.2%+4.2\% absolute improvement at s=5s=5 over the best DRO baseline, and +1.1%+1.1\% even on clean data. Similar trends hold on CIFAR-100 and ImageNet30, and across all corruption types.
  • Distribution-agnostic performance: SharpDRO with OOD selection surpasses JTT/EIIL by up to +5.1%+5.1\% at s=5s=5 on ImageNet30.
  • Ablation studies: Removing data selection (i.e., standard SAM on the full mixture) benefits clean accuracy but significantly reduces performance on highly corrupted data (underperforming GroupDRO). Disabling sharpness minimization (GroupDRO) fails to achieve flat worst-case loss surfaces, consistently underperforming across all corruption severities.
  • Sensitivity to hyperparameters: Increasing ρ\rho improves worst-case (s=5s=5) accuracy with a slight reduction on clean accuracy, reflecting a trade-off between radius and flatness.
  • OOD scoring: The OOD score ωi=maxf(θ;xi)maxf(θ+ϵ;xi)\omega_i = \max f(\theta; x_i) - \max f(\theta+\epsilon^*; x_i) effectively isolates high-severity samples.
  • Training stability: SharpDRO produces the smallest and most uniform gradient norms across all severity levels, evidencing balanced optimization dynamics.
  • Computational efficiency: The method adds negligible overhead per epoch compared to SAM, as sharpness is evaluated only for the focused hard subset rather than the entire mixture.

7. Context and Significance

SharpDRO addresses a known limitation of standard DRO by mitigating overfitting to rare, severely corrupted subsets that are prone to sharp and poorly generalizing minima. By integrating sharpness penalization targeted at worst-case distributions or severe examples, SharpDRO enables robust generalization and consistent performance gains in challenging realistic settings involving photon-limited corruptions. The method maintains strong theoretical convergence properties and is empirically validated on several large-scale benchmarks, outperforming existing robust and OOD optimization baselines (Huang et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to SharpDRO.