Papers
Topics
Authors
Recent
Search
2000 character limit reached

PGD Hyperparameter Insights

Updated 15 April 2026
  • The paper demonstrates that tuning hyperparameters like using Adam and step-size decay can significantly improve PGD attack performance and stability.
  • The analysis reveals that optimizer selection, step-size scheduling, and surrogate loss choices distinctly influence PGD's convergence and robustness.
  • A multi-targeted approach, which maximizes convex surrogates across classes, offers a theoretical basis for enhanced adversarial testing efficacy.

Projected Gradient Descent (PGD) is the canonical method for white-box adversarial testing under norm-bounded perturbations. The attack’s performance, convergence, and stability are tightly governed by the interplay of its hyperparameters: optimizer, step-size schedule, surrogate loss, and advanced constructions like MultiTargeted surrogates. Each hyperparameter modulates attack success and efficiency, and their configuration has led to state-of-the-art adversarial test results against robust models such as those of MadryLab and TRADES (Gowal et al., 2019).

1. Optimizer Choice: SGD‐sign, Momentum, Adam

Hypergradient updates in PGD can be computed using three principal optimizers, each imparting distinct characteristics to the attack trajectory:

  • SGD-sign (FGSMK): Utilizes discrete sign updates:

δk+1=ProjS(δk+αsign(xL^(f(x+δk),y)))\delta_{k+1} = \operatorname{Proj}_{S} \left( \delta_{k} + \alpha \cdot \text{sign}\left(\nabla_{x} \hat{L}(f(x+\delta_k), y)\right) \right)

This approach is computationally cheap and tuning is straightforward when network activations are unsaturated. However, it displays high sensitivity to the step-size α\alpha and can exhibit oscillatory or plateaued trajectories, especially in regions of the loss surface with flat gradients.

  • Momentum: Incorporates an exponential moving average of past gradients:

mk=βmk1+(1β)xL^(f(x+δk),y)m_{k} = \beta\,m_{k-1} + (1-\beta)\nabla_{x} \hat{L}(f(x+\delta_k), y)

δk+1=ProjS(δk+αsign(mk))\delta_{k+1} = \operatorname{Proj}_{S} (\delta_{k} + \alpha \cdot \text{sign}(m_k))

This can aid in traversing shallow loss basins and produces smoother gradients but demands additional memory and remains sensitive to α\alpha scheduling.

  • Adam: Employs both first and second moment estimates:

mt=β1mt1+(1β1)gt;vt=β2vt1+(1β2)gt2m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t;\quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2

m^t=mt/(1β1t);v^t=vt/(1β2t)\hat{m}_t = m_t / (1-\beta_1^{t});\quad \hat{v}_t = v_t / (1-\beta_2^{t})

δk+1=ProjS(δk+αm^tv^t+ϵ)\delta_{k+1} = \operatorname{Proj}_{S}\left(\delta_{k} + \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\right)

Adam is empirically the most stable across architectures and step-sizes, requiring minimal manual tuning. The modest computation overhead is amortized by a substantial reduction in restarts and attack iterations.

Summarizing, Adam yields the best out-of-the-box stability and attack strength; momentum is marginally less effective but more robust than sign-based variants, which—while fast—are demonstrably brittle (Gowal et al., 2019).

Optimizer Pros Cons
SGD-sign Cheap; simple tuning Highly sensitive to α\alpha
Momentum Smoother trajectory; escapes basins Needs memory; α\alpha-sensitive
Adam Stable; less manual tuning Slightly higher compute

2. Step-Size (α\alpha0) Scheduling

Step-size α\alpha1 essentially controls the effective resolution of each gradient step. While a fixed α\alpha2 (as in canonical FGSMK) can yield superficial success, it quickly saturates. The use of a scheduled decay for α\alpha3 significantly enhances performance, especially on datasets like CIFAR-10.

The recommended decay regime is as follows:

  • Initial α\alpha4
  • Reduce by α\alpha5 at α\alpha6 and again at α\alpha7 (where α\alpha8 is the total number of PGD steps)

This doubling of success rate versus fixed α\alpha9 is quantitatively illustrated in Figure 3b of (Gowal et al., 2019). In regions where the surrogate loss mk=βmk1+(1β)xL^(f(x+δk),y)m_{k} = \beta\,m_{k-1} + (1-\beta)\nabla_{x} \hat{L}(f(x+\delta_k), y)0 is mk=βmk1+(1β)xL^(f(x+δk),y)m_{k} = \beta\,m_{k-1} + (1-\beta)\nabla_{x} \hat{L}(f(x+\delta_k), y)1-smooth, an informal bound shows that mk=βmk1+(1β)xL^(f(x+δk),y)m_{k} = \beta\,m_{k-1} + (1-\beta)\nabla_{x} \hat{L}(f(x+\delta_k), y)2 ensures ascent, but due to local variability in mk=βmk1+(1β)xL^(f(x+δk),y)m_{k} = \beta\,m_{k-1} + (1-\beta)\nabla_{x} \hat{L}(f(x+\delta_k), y)3, decaying mk=βmk1+(1β)xL^(f(x+δk),y)m_{k} = \beta\,m_{k-1} + (1-\beta)\nabla_{x} \hat{L}(f(x+\delta_k), y)4 empirically prevents overshoot and loss surface trapping.

A practical rule is to tune with a single restart, mk=βmk1+(1β)xL^(f(x+δk),y)m_{k} = \beta\,m_{k-1} + (1-\beta)\nabla_{x} \hat{L}(f(x+\delta_k), y)5, and mk=βmk1+(1β)xL^(f(x+δk),y)m_{k} = \beta\,m_{k-1} + (1-\beta)\nabla_{x} \hat{L}(f(x+\delta_k), y)6, with decays at 50% and 75% of mk=βmk1+(1β)xL^(f(x+δk),y)m_{k} = \beta\,m_{k-1} + (1-\beta)\nabla_{x} \hat{L}(f(x+\delta_k), y)7.

3. Surrogate Loss Selection

The surrogate loss defines what is maximized during attack iterations. Common surrogates and their properties are:

  • Cross-Entropy (CE):

mk=βmk1+(1β)xL^(f(x+δk),y)m_{k} = \beta\,m_{k-1} + (1-\beta)\nabla_{x} \hat{L}(f(x+\delta_k), y)8

Smooth, often easier to optimize in early stages.

  • Margin Loss:

mk=βmk1+(1β)xL^(f(x+δk),y)m_{k} = \beta\,m_{k-1} + (1-\beta)\nabla_{x} \hat{L}(f(x+\delta_k), y)9

Maximizes the leading runner-up logit relative to correct label; produces sharper adversarial boundaries.

  • Carlini–Wagner (CW δk+1=ProjS(δk+αsign(mk))\delta_{k+1} = \operatorname{Proj}_{S} (\delta_{k} + \alpha \cdot \text{sign}(m_k))0-Loss):

δk+1=ProjS(δk+αsign(mk))\delta_{k+1} = \operatorname{Proj}_{S} (\delta_{k} + \alpha \cdot \text{sign}(m_k))1

Allows tradeoff between confidence gap (margin) and optimization step effort.

Empirically, margin and CE are comparable on most models. Tuned CW—varying δk+1=ProjS(δk+αsign(mk))\delta_{k+1} = \operatorname{Proj}_{S} (\delta_{k} + \alpha \cdot \text{sign}(m_k))2—may, however, surpass both when appropriately calibrated (Gowal et al., 2019).

4. MultiTargeted Surrogate and Algorithm

The MultiTargeted procedure generalizes PGD by explicitly attacking individual target classes via their logit differences:

  • Definition: For each target class δk+1=ProjS(δk+αsign(mk))\delta_{k+1} = \operatorname{Proj}_{S} (\delta_{k} + \alpha \cdot \text{sign}(m_k))3,

δk+1=ProjS(δk+αsign(mk))\delta_{k+1} = \operatorname{Proj}_{S} (\delta_{k} + \alpha \cdot \text{sign}(m_k))4

  • Algorithm Sketch:
  1. Enumerate δk+1=ProjS(δk+αsign(mk))\delta_{k+1} = \operatorname{Proj}_{S} (\delta_{k} + \alpha \cdot \text{sign}(m_k))5 target classes (all or, e.g., top-δk+1=ProjS(δk+αsign(mk))\delta_{k+1} = \operatorname{Proj}_{S} (\delta_{k} + \alpha \cdot \text{sign}(m_k))6 by unperturbed logit).
  2. For each δk+1=ProjS(δk+αsign(mk))\delta_{k+1} = \operatorname{Proj}_{S} (\delta_{k} + \alpha \cdot \text{sign}(m_k))7, run δk+1=ProjS(δk+αsign(mk))\delta_{k+1} = \operatorname{Proj}_{S} (\delta_{k} + \alpha \cdot \text{sign}(m_k))8 Adam-PGD steps maximizing δk+1=ProjS(δk+αsign(mk))\delta_{k+1} = \operatorname{Proj}_{S} (\delta_{k} + \alpha \cdot \text{sign}(m_k))9; retain the perturbation α\alpha0 that yields the greatest misclassification.

Theoretical guarantee (Theorem 3.2, (Gowal et al., 2019)): For any locally linear α\alpha1 on convex set α\alpha2 with α\alpha3 output logits, using α\alpha4 restarts (one per α\alpha5), MultiTargeted attains a global maximizer of convex surrogates within α\alpha6. The proof observes that maximizing α\alpha7 is equivalent to exploring each half-space of the logit polytope, so iterating over all α\alpha8 alternatives covers the solution space.

On practical datasets, across four WideResNet models on CIFAR-10, MultiTargeted (α\alpha9) consistently lowers robust accuracy by mt=β1mt1+(1β1)gt;vt=β2vt1+(1β2)gt2m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t;\quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^20–mt=β1mt1+(1β1)gt;vt=β2vt1+(1β2)gt2m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t;\quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^21 points compared to standard PGD (mt=β1mt1+(1β1)gt;vt=β2vt1+(1β2)gt2m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t;\quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^22) for all mt=β1mt1+(1β1)gt;vt=β2vt1+(1β2)gt2m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t;\quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^23.

5. Empirical Results on MNIST and CIFAR-10

The effect of tuning PGD hyperparameters and adopting MultiTargeted testing is evident in benchmark results:

  • MNIST (mt=β1mt1+(1β1)gt;vt=β2vt1+(1β2)gt2m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t;\quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^24), MadryLab model:
    • PGDmt=β1mt1+(1β1)gt;vt=β2vt1+(1β2)gt2m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t;\quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^25 (tuned): mt=β1mt1+(1β1)gt;vt=β2vt1+(1β2)gt2m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t;\quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^26 accuracy under attack.
    • MultiTargetedmt=β1mt1+(1β1)gt;vt=β2vt1+(1β2)gt2m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t;\quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^27: mt=β1mt1+(1β1)gt;vt=β2vt1+(1β2)gt2m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t;\quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^28.
    • PGD + MultiTargeted (combined): mt=β1mt1+(1β1)gt;vt=β2vt1+(1β2)gt2m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t;\quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^29 (lowest).
    • IntervalAttack (best prior): m^t=mt/(1β1t);v^t=vt/(1β2t)\hat{m}_t = m_t / (1-\beta_1^{t});\quad \hat{v}_t = v_t / (1-\beta_2^{t})0.
  • CIFAR-10 (m^t=mt/(1β1t);v^t=vt/(1β2t)\hat{m}_t = m_t / (1-\beta_1^{t});\quad \hat{v}_t = v_t / (1-\beta_2^{t})1), MadryLab model:
    • PGDm^t=mt/(1β1t);v^t=vt/(1β2t)\hat{m}_t = m_t / (1-\beta_1^{t});\quad \hat{v}_t = v_t / (1-\beta_2^{t})2: m^t=mt/(1β1t);v^t=vt/(1β2t)\hat{m}_t = m_t / (1-\beta_1^{t});\quad \hat{v}_t = v_t / (1-\beta_2^{t})3.
    • MultiTargetedm^t=mt/(1β1t);v^t=vt/(1β2t)\hat{m}_t = m_t / (1-\beta_1^{t});\quad \hat{v}_t = v_t / (1-\beta_2^{t})4: m^t=mt/(1β1t);v^t=vt/(1β2t)\hat{m}_t = m_t / (1-\beta_1^{t});\quad \hat{v}_t = v_t / (1-\beta_2^{t})5.
    • PGD + MultiTargeted (combined): m^t=mt/(1β1t);v^t=vt/(1β2t)\hat{m}_t = m_t / (1-\beta_1^{t});\quad \hat{v}_t = v_t / (1-\beta_2^{t})6.
    • FABAttack (best prior): m^t=mt/(1β1t);v^t=vt/(1β2t)\hat{m}_t = m_t / (1-\beta_1^{t});\quad \hat{v}_t = v_t / (1-\beta_2^{t})7.

Table 5 in (Gowal et al., 2019) provides leaderboard comparisons; MultiTargeted obtained first rank for both datasets, and for the TRADES model (m^t=mt/(1β1t);v^t=vt/(1β2t)\hat{m}_t = m_t / (1-\beta_1^{t});\quad \hat{v}_t = v_t / (1-\beta_2^{t})8 accuracy at m^t=mt/(1β1t);v^t=vt/(1β2t)\hat{m}_t = m_t / (1-\beta_1^{t});\quad \hat{v}_t = v_t / (1-\beta_2^{t})9).

6. Practitioner Recommendations

Empirical findings in (Gowal et al., 2019) yield best-practice guidelines for configuring PGD and MultiTargeted adversarial testing:

  • Optimizer: Use Adam (δk+1=ProjS(δk+αm^tv^t+ϵ)\delta_{k+1} = \operatorname{Proj}_{S}\left(\delta_{k} + \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\right)0, δk+1=ProjS(δk+αm^tv^t+ϵ)\delta_{k+1} = \operatorname{Proj}_{S}\left(\delta_{k} + \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\right)1, δk+1=ProjS(δk+αm^tv^t+ϵ)\delta_{k+1} = \operatorname{Proj}_{S}\left(\delta_{k} + \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\right)2) for maximal stability.
  • Step-size: Start with δk+1=ProjS(δk+αm^tv^t+ϵ)\delta_{k+1} = \operatorname{Proj}_{S}\left(\delta_{k} + \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\right)3, decay by δk+1=ProjS(δk+αm^tv^t+ϵ)\delta_{k+1} = \operatorname{Proj}_{S}\left(\delta_{k} + \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\right)4 at δk+1=ProjS(δk+αm^tv^t+ϵ)\delta_{k+1} = \operatorname{Proj}_{S}\left(\delta_{k} + \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\right)5 and δk+1=ProjS(δk+αm^tv^t+ϵ)\delta_{k+1} = \operatorname{Proj}_{S}\left(\delta_{k} + \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\right)6; for pure sign methods use δk+1=ProjS(δk+αm^tv^t+ϵ)\delta_{k+1} = \operatorname{Proj}_{S}\left(\delta_{k} + \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\right)7 without decay.
  • Surrogate loss: Prefer margin loss or CE; margin can yield marginal improvements if computational budget allows.
  • Restarts vs Steps: First optimize hyperparameters (δk+1=ProjS(δk+αm^tv^t+ϵ)\delta_{k+1} = \operatorname{Proj}_{S}\left(\delta_{k} + \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\right)8, optimizer, loss) with a single restart; then, increase δk+1=ProjS(δk+αm^tv^t+ϵ)\delta_{k+1} = \operatorname{Proj}_{S}\left(\delta_{k} + \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\right)9. On smooth losses (CIFAR-10 WResNet), MultiTargeted suffices with fewer restarts; on non-smooth (MNIST), more restarts are advantageous.
  • MultiTargeted: For models with α\alpha0 or those locally linear (adversarially trained), MultiTargeted with α\alpha1–α\alpha2 is optimal. Otherwise, use full α\alpha3 but reduce inner iterations to maintain constant total attack budget.
  • Baseline defaults: MNIST: PGDα\alpha4 with α\alpha5 decay (α\alpha6). CIFAR-10: MTα\alpha7 (α\alpha8).

In all experiments, the robustness lower bound should be validated by combining PGD, MultiTargeted, and increased restarts as no single method universally saturates robust error.

7. Significance and Theoretical Implications

The identification and rigorous benchmarking of PGD hyperparameter effects have materially advanced adversarial robustness evaluation. Adam optimizer and α\alpha9-decay scheduling, used in conjunction with convex surrogate losses and MultiTargeted logic, define the current empirical frontier in white-box attack design. The guarantee that MultiTargeted with α\alpha0 restarts globally maximizes convex surrogates under local linearity places the method on a firm theoretical foundation. This suggests that for modern adversarially trained models, strategic multiplicity in attack targets, optimizer adaptivity, and calibrated decay schedules are critical for accurate robustness estimation (Gowal et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hyperparameter Insights for PGD.