PGD Hyperparameter Insights

Updated 15 April 2026

The paper demonstrates that tuning hyperparameters like using Adam and step-size decay can significantly improve PGD attack performance and stability.
The analysis reveals that optimizer selection, step-size scheduling, and surrogate loss choices distinctly influence PGD's convergence and robustness.
A multi-targeted approach, which maximizes convex surrogates across classes, offers a theoretical basis for enhanced adversarial testing efficacy.

Projected Gradient Descent (PGD) is the canonical method for white-box adversarial testing under norm-bounded perturbations. The attack’s performance, convergence, and stability are tightly governed by the interplay of its hyperparameters: optimizer, step-size schedule, surrogate loss, and advanced constructions like MultiTargeted surrogates. Each hyperparameter modulates attack success and efficiency, and their configuration has led to state-of-the-art adversarial test results against robust models such as those of MadryLab and TRADES (Gowal et al., 2019).

1. Optimizer Choice: SGD‐sign, Momentum, Adam

Hypergradient updates in PGD can be computed using three principal optimizers, each imparting distinct characteristics to the attack trajectory:

SGD-sign (FGSMK): Utilizes discrete sign updates:

$\delta_{k+1} = \operatorname{Proj}_{S} \left( \delta_{k} + \alpha \cdot \text{sign}\left(\nabla_{x} \hat{L}(f(x+\delta_k), y)\right) \right)$

This approach is computationally cheap and tuning is straightforward when network activations are unsaturated. However, it displays high sensitivity to the step-size $\alpha$ and can exhibit oscillatory or plateaued trajectories, especially in regions of the loss surface with flat gradients.

Momentum: Incorporates an exponential moving average of past gradients:

$m_{k} = \beta\,m_{k-1} + (1-\beta)\nabla_{x} \hat{L}(f(x+\delta_k), y)$

$\delta_{k+1} = \operatorname{Proj}_{S} (\delta_{k} + \alpha \cdot \text{sign}(m_k))$

This can aid in traversing shallow loss basins and produces smoother gradients but demands additional memory and remains sensitive to $\alpha$ scheduling.

Adam: Employs both first and second moment estimates:

$m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t;\quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$

$\hat{m}_t = m_t / (1-\beta_1^{t});\quad \hat{v}_t = v_t / (1-\beta_2^{t})$

$\delta_{k+1} = \operatorname{Proj}_{S}\left(\delta_{k} + \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\right)$

Adam is empirically the most stable across architectures and step-sizes, requiring minimal manual tuning. The modest computation overhead is amortized by a substantial reduction in restarts and attack iterations.

Summarizing, Adam yields the best out-of-the-box stability and attack strength; momentum is marginally less effective but more robust than sign-based variants, which—while fast—are demonstrably brittle (Gowal et al., 2019).

Optimizer	Pros	Cons
SGD-sign	Cheap; simple tuning	Highly sensitive to $\alpha$
Momentum	Smoother trajectory; escapes basins	Needs memory; $\alpha$ -sensitive
Adam	Stable; less manual tuning	Slightly higher compute

2. Step-Size ( $\alpha$ 0) Scheduling

Step-size $\alpha$ 1 essentially controls the effective resolution of each gradient step. While a fixed $\alpha$ 2 (as in canonical FGSMK) can yield superficial success, it quickly saturates. The use of a scheduled decay for $\alpha$ 3 significantly enhances performance, especially on datasets like CIFAR-10.

The recommended decay regime is as follows:

Initial $\alpha$ 4
Reduce by $\alpha$ 5 at $\alpha$ 6 and again at $\alpha$ 7 (where $\alpha$ 8 is the total number of PGD steps)

This doubling of success rate versus fixed $\alpha$ 9 is quantitatively illustrated in Figure 3b of (Gowal et al., 2019). In regions where the surrogate loss $m_{k} = \beta\,m_{k-1} + (1-\beta)\nabla_{x} \hat{L}(f(x+\delta_k), y)$ 0 is $m_{k} = \beta\,m_{k-1} + (1-\beta)\nabla_{x} \hat{L}(f(x+\delta_k), y)$ 1-smooth, an informal bound shows that $m_{k} = \beta\,m_{k-1} + (1-\beta)\nabla_{x} \hat{L}(f(x+\delta_k), y)$ 2 ensures ascent, but due to local variability in $m_{k} = \beta\,m_{k-1} + (1-\beta)\nabla_{x} \hat{L}(f(x+\delta_k), y)$ 3, decaying $m_{k} = \beta\,m_{k-1} + (1-\beta)\nabla_{x} \hat{L}(f(x+\delta_k), y)$ 4 empirically prevents overshoot and loss surface trapping.

A practical rule is to tune with a single restart, $m_{k} = \beta\,m_{k-1} + (1-\beta)\nabla_{x} \hat{L}(f(x+\delta_k), y)$ 5, and $m_{k} = \beta\,m_{k-1} + (1-\beta)\nabla_{x} \hat{L}(f(x+\delta_k), y)$ 6, with decays at 50% and 75% of $m_{k} = \beta\,m_{k-1} + (1-\beta)\nabla_{x} \hat{L}(f(x+\delta_k), y)$ 7.

3. Surrogate Loss Selection

The surrogate loss defines what is maximized during attack iterations. Common surrogates and their properties are:

Cross-Entropy (CE):

$m_{k} = \beta\,m_{k-1} + (1-\beta)\nabla_{x} \hat{L}(f(x+\delta_k), y)$ 8

Smooth, often easier to optimize in early stages.

Margin Loss:

$m_{k} = \beta\,m_{k-1} + (1-\beta)\nabla_{x} \hat{L}(f(x+\delta_k), y)$ 9

Maximizes the leading runner-up logit relative to correct label; produces sharper adversarial boundaries.

Carlini–Wagner (CW $\delta_{k+1} = \operatorname{Proj}_{S} (\delta_{k} + \alpha \cdot \text{sign}(m_k))$ 0-Loss):

$\delta_{k+1} = \operatorname{Proj}_{S} (\delta_{k} + \alpha \cdot \text{sign}(m_k))$ 1

Allows tradeoff between confidence gap (margin) and optimization step effort.

Empirically, margin and CE are comparable on most models. Tuned CW—varying $\delta_{k+1} = \operatorname{Proj}_{S} (\delta_{k} + \alpha \cdot \text{sign}(m_k))$ 2—may, however, surpass both when appropriately calibrated (Gowal et al., 2019).

4. MultiTargeted Surrogate and Algorithm

The MultiTargeted procedure generalizes PGD by explicitly attacking individual target classes via their logit differences:

Definition: For each target class $\delta_{k+1} = \operatorname{Proj}_{S} (\delta_{k} + \alpha \cdot \text{sign}(m_k))$ 3,

$\delta_{k+1} = \operatorname{Proj}_{S} (\delta_{k} + \alpha \cdot \text{sign}(m_k))$ 4

Algorithm Sketch:

Enumerate $\delta_{k+1} = \operatorname{Proj}_{S} (\delta_{k} + \alpha \cdot \text{sign}(m_k))$ 5 target classes (all or, e.g., top- $\delta_{k+1} = \operatorname{Proj}_{S} (\delta_{k} + \alpha \cdot \text{sign}(m_k))$ 6 by unperturbed logit).
For each $\delta_{k+1} = \operatorname{Proj}_{S} (\delta_{k} + \alpha \cdot \text{sign}(m_k))$ 7, run $\delta_{k+1} = \operatorname{Proj}_{S} (\delta_{k} + \alpha \cdot \text{sign}(m_k))$ 8 Adam-PGD steps maximizing $\delta_{k+1} = \operatorname{Proj}_{S} (\delta_{k} + \alpha \cdot \text{sign}(m_k))$ 9; retain the perturbation $\alpha$ 0 that yields the greatest misclassification.

Theoretical guarantee (Theorem 3.2, (Gowal et al., 2019)): For any locally linear $\alpha$ 1 on convex set $\alpha$ 2 with $\alpha$ 3 output logits, using $\alpha$ 4 restarts (one per $\alpha$ 5), MultiTargeted attains a global maximizer of convex surrogates within $\alpha$ 6. The proof observes that maximizing $\alpha$ 7 is equivalent to exploring each half-space of the logit polytope, so iterating over all $\alpha$ 8 alternatives covers the solution space.

On practical datasets, across four WideResNet models on CIFAR-10, MultiTargeted ( $\alpha$ 9) consistently lowers robust accuracy by $m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t;\quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$ 0– $m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t;\quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$ 1 points compared to standard PGD ( $m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t;\quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$ 2) for all $m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t;\quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$ 3.

5. Empirical Results on MNIST and CIFAR-10

The effect of tuning PGD hyperparameters and adopting MultiTargeted testing is evident in benchmark results:

MNIST ( $m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t;\quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$ 4), MadryLab model:
- PGD $m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t;\quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$ 5 (tuned): $m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t;\quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$ 6 accuracy under attack.
- MultiTargeted $m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t;\quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$ 7: $m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t;\quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$ 8.
- PGD + MultiTargeted (combined): $m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t;\quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2$ 9 (lowest).
- IntervalAttack (best prior): $\hat{m}_t = m_t / (1-\beta_1^{t});\quad \hat{v}_t = v_t / (1-\beta_2^{t})$ 0.
CIFAR-10 ( $\hat{m}_t = m_t / (1-\beta_1^{t});\quad \hat{v}_t = v_t / (1-\beta_2^{t})$ 1), MadryLab model:
- PGD $\hat{m}_t = m_t / (1-\beta_1^{t});\quad \hat{v}_t = v_t / (1-\beta_2^{t})$ 2: $\hat{m}_t = m_t / (1-\beta_1^{t});\quad \hat{v}_t = v_t / (1-\beta_2^{t})$ 3.
- MultiTargeted $\hat{m}_t = m_t / (1-\beta_1^{t});\quad \hat{v}_t = v_t / (1-\beta_2^{t})$ 4: $\hat{m}_t = m_t / (1-\beta_1^{t});\quad \hat{v}_t = v_t / (1-\beta_2^{t})$ 5.
- PGD + MultiTargeted (combined): $\hat{m}_t = m_t / (1-\beta_1^{t});\quad \hat{v}_t = v_t / (1-\beta_2^{t})$ 6.
- FABAttack (best prior): $\hat{m}_t = m_t / (1-\beta_1^{t});\quad \hat{v}_t = v_t / (1-\beta_2^{t})$ 7.

Table 5 in (Gowal et al., 2019) provides leaderboard comparisons; MultiTargeted obtained first rank for both datasets, and for the TRADES model ( $\hat{m}_t = m_t / (1-\beta_1^{t});\quad \hat{v}_t = v_t / (1-\beta_2^{t})$ 8 accuracy at $\hat{m}_t = m_t / (1-\beta_1^{t});\quad \hat{v}_t = v_t / (1-\beta_2^{t})$ 9).

6. Practitioner Recommendations

Empirical findings in (Gowal et al., 2019) yield best-practice guidelines for configuring PGD and MultiTargeted adversarial testing:

Optimizer: Use Adam ( $\delta_{k+1} = \operatorname{Proj}_{S}\left(\delta_{k} + \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\right)$ 0, $\delta_{k+1} = \operatorname{Proj}_{S}\left(\delta_{k} + \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\right)$ 1, $\delta_{k+1} = \operatorname{Proj}_{S}\left(\delta_{k} + \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\right)$ 2) for maximal stability.
Step-size: Start with $\delta_{k+1} = \operatorname{Proj}_{S}\left(\delta_{k} + \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\right)$ 3, decay by $\delta_{k+1} = \operatorname{Proj}_{S}\left(\delta_{k} + \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\right)$ 4 at $\delta_{k+1} = \operatorname{Proj}_{S}\left(\delta_{k} + \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\right)$ 5 and $\delta_{k+1} = \operatorname{Proj}_{S}\left(\delta_{k} + \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\right)$ 6; for pure sign methods use $\delta_{k+1} = \operatorname{Proj}_{S}\left(\delta_{k} + \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\right)$ 7 without decay.
Surrogate loss: Prefer margin loss or CE; margin can yield marginal improvements if computational budget allows.
Restarts vs Steps: First optimize hyperparameters ( $\delta_{k+1} = \operatorname{Proj}_{S}\left(\delta_{k} + \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\right)$ 8, optimizer, loss) with a single restart; then, increase $\delta_{k+1} = \operatorname{Proj}_{S}\left(\delta_{k} + \alpha \cdot \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\right)$ 9. On smooth losses (CIFAR-10 WResNet), MultiTargeted suffices with fewer restarts; on non-smooth (MNIST), more restarts are advantageous.
MultiTargeted: For models with $\alpha$ 0 or those locally linear (adversarially trained), MultiTargeted with $\alpha$ 1– $\alpha$ 2 is optimal. Otherwise, use full $\alpha$ 3 but reduce inner iterations to maintain constant total attack budget.
Baseline defaults: MNIST: PGD $\alpha$ 4 with $\alpha$ 5 decay ( $\alpha$ 6). CIFAR-10: MT $\alpha$ 7 ( $\alpha$ 8).

In all experiments, the robustness lower bound should be validated by combining PGD, MultiTargeted, and increased restarts as no single method universally saturates robust error.

7. Significance and Theoretical Implications

The identification and rigorous benchmarking of PGD hyperparameter effects have materially advanced adversarial robustness evaluation. Adam optimizer and $\alpha$ 9-decay scheduling, used in conjunction with convex surrogate losses and MultiTargeted logic, define the current empirical frontier in white-box attack design. The guarantee that MultiTargeted with $\alpha$ 0 restarts globally maximizes convex surrogates under local linearity places the method on a firm theoretical foundation. This suggests that for modern adversarially trained models, strategic multiplicity in attack targets, optimizer adaptivity, and calibrated decay schedules are critical for accurate robustness estimation (Gowal et al., 2019).

Markdown Report Issue Upgrade to Chat

References (1)

An Alternative Surrogate Loss for PGD-based Adversarial Testing (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hyperparameter Insights for PGD.