The paper demonstrates that tuning hyperparameters like using Adam and step-size decay can significantly improve PGD attack performance and stability.
The analysis reveals that optimizer selection, step-size scheduling, and surrogate loss choices distinctly influence PGD's convergence and robustness.
A multi-targeted approach, which maximizes convex surrogates across classes, offers a theoretical basis for enhanced adversarial testing efficacy.
Projected Gradient Descent (PGD) is the canonical method for white-box adversarial testing under norm-bounded perturbations. The attack’s performance, convergence, and stability are tightly governed by the interplay of its hyperparameters: optimizer, step-size schedule, surrogate loss, and advanced constructions like MultiTargeted surrogates. Each hyperparameter modulates attack success and efficiency, and their configuration has led to state-of-the-art adversarial test results against robust models such as those of MadryLab and TRADES (Gowal et al., 2019).
1. Optimizer Choice: SGD‐sign, Momentum, Adam
Hypergradient updates in PGD can be computed using three principal optimizers, each imparting distinct characteristics to the attack trajectory:
SGD-sign (FGSMK): Utilizes discrete sign updates:
δk+1=ProjS(δk+α⋅sign(∇xL^(f(x+δk),y)))
This approach is computationally cheap and tuning is straightforward when network activations are unsaturated. However, it displays high sensitivity to the step-size α and can exhibit oscillatory or plateaued trajectories, especially in regions of the loss surface with flat gradients.
Momentum: Incorporates an exponential moving average of past gradients:
mk=βmk−1+(1−β)∇xL^(f(x+δk),y)
δk+1=ProjS(δk+α⋅sign(mk))
This can aid in traversing shallow loss basins and produces smoother gradients but demands additional memory and remains sensitive to α scheduling.
Adam: Employs both first and second moment estimates:
mt=β1mt−1+(1−β1)gt;vt=β2vt−1+(1−β2)gt2
m^t=mt/(1−β1t);v^t=vt/(1−β2t)
δk+1=ProjS(δk+α⋅v^t+ϵm^t)
Adam is empirically the most stable across architectures and step-sizes, requiring minimal manual tuning. The modest computation overhead is amortized by a substantial reduction in restarts and attack iterations.
Summarizing, Adam yields the best out-of-the-box stability and attack strength; momentum is marginally less effective but more robust than sign-based variants, which—while fast—are demonstrably brittle (Gowal et al., 2019).
Optimizer
Pros
Cons
SGD-sign
Cheap; simple tuning
Highly sensitive to α
Momentum
Smoother trajectory; escapes basins
Needs memory; α-sensitive
Adam
Stable; less manual tuning
Slightly higher compute
2. Step-Size (α0) Scheduling
Step-size α1 essentially controls the effective resolution of each gradient step. While a fixed α2 (as in canonical FGSMK) can yield superficial success, it quickly saturates. The use of a scheduled decay for α3 significantly enhances performance, especially on datasets like CIFAR-10.
The recommended decay regime is as follows:
Initial α4
Reduce by α5 at α6 and again at α7 (where α8 is the total number of PGD steps)
This doubling of success rate versus fixed α9 is quantitatively illustrated in Figure 3b of (Gowal et al., 2019). In regions where the surrogate loss mk=βmk−1+(1−β)∇xL^(f(x+δk),y)0 is mk=βmk−1+(1−β)∇xL^(f(x+δk),y)1-smooth, an informal bound shows that mk=βmk−1+(1−β)∇xL^(f(x+δk),y)2 ensures ascent, but due to local variability in mk=βmk−1+(1−β)∇xL^(f(x+δk),y)3, decaying mk=βmk−1+(1−β)∇xL^(f(x+δk),y)4 empirically prevents overshoot and loss surface trapping.
A practical rule is to tune with a single restart, mk=βmk−1+(1−β)∇xL^(f(x+δk),y)5, and mk=βmk−1+(1−β)∇xL^(f(x+δk),y)6, with decays at 50% and 75% of mk=βmk−1+(1−β)∇xL^(f(x+δk),y)7.
3. Surrogate Loss Selection
The surrogate loss defines what is maximized during attack iterations. Common surrogates and their properties are:
Cross-Entropy (CE):
mk=βmk−1+(1−β)∇xL^(f(x+δk),y)8
Smooth, often easier to optimize in early stages.
Margin Loss:
mk=βmk−1+(1−β)∇xL^(f(x+δk),y)9
Maximizes the leading runner-up logit relative to correct label; produces sharper adversarial boundaries.
Allows tradeoff between confidence gap (margin) and optimization step effort.
Empirically, margin and CE are comparable on most models. Tuned CW—varying δk+1=ProjS(δk+α⋅sign(mk))2—may, however, surpass both when appropriately calibrated (Gowal et al., 2019).
4. MultiTargeted Surrogate and Algorithm
The MultiTargeted procedure generalizes PGD by explicitly attacking individual target classes via their logit differences:
Definition: For each target class δk+1=ProjS(δk+α⋅sign(mk))3,
For each δk+1=ProjS(δk+α⋅sign(mk))7, run δk+1=ProjS(δk+α⋅sign(mk))8 Adam-PGD steps maximizing δk+1=ProjS(δk+α⋅sign(mk))9; retain the perturbation α0 that yields the greatest misclassification.
Theoretical guarantee (Theorem 3.2, (Gowal et al., 2019)): For any locally linear α1 on convex set α2 with α3 output logits, using α4 restarts (one per α5), MultiTargeted attains a global maximizer of convex surrogates within α6. The proof observes that maximizing α7 is equivalent to exploring each half-space of the logit polytope, so iterating over all α8 alternatives covers the solution space.
On practical datasets, across four WideResNet models on CIFAR-10, MultiTargeted (α9) consistently lowers robust accuracy by mt=β1mt−1+(1−β1)gt;vt=β2vt−1+(1−β2)gt20–mt=β1mt−1+(1−β1)gt;vt=β2vt−1+(1−β2)gt21 points compared to standard PGD (mt=β1mt−1+(1−β1)gt;vt=β2vt−1+(1−β2)gt22) for all mt=β1mt−1+(1−β1)gt;vt=β2vt−1+(1−β2)gt23.
5. Empirical Results on MNIST and CIFAR-10
The effect of tuning PGD hyperparameters and adopting MultiTargeted testing is evident in benchmark results:
Table 5 in (Gowal et al., 2019) provides leaderboard comparisons; MultiTargeted obtained first rank for both datasets, and for the TRADES model (m^t=mt/(1−β1t);v^t=vt/(1−β2t)8 accuracy at m^t=mt/(1−β1t);v^t=vt/(1−β2t)9).
6. Practitioner Recommendations
Empirical findings in (Gowal et al., 2019) yield best-practice guidelines for configuring PGD and MultiTargeted adversarial testing:
Optimizer: Use Adam (δk+1=ProjS(δk+α⋅v^t+ϵm^t)0, δk+1=ProjS(δk+α⋅v^t+ϵm^t)1, δk+1=ProjS(δk+α⋅v^t+ϵm^t)2) for maximal stability.
Step-size: Start with δk+1=ProjS(δk+α⋅v^t+ϵm^t)3, decay by δk+1=ProjS(δk+α⋅v^t+ϵm^t)4 at δk+1=ProjS(δk+α⋅v^t+ϵm^t)5 and δk+1=ProjS(δk+α⋅v^t+ϵm^t)6; for pure sign methods use δk+1=ProjS(δk+α⋅v^t+ϵm^t)7 without decay.
Surrogate loss: Prefer margin loss or CE; margin can yield marginal improvements if computational budget allows.
Restarts vs Steps: First optimize hyperparameters (δk+1=ProjS(δk+α⋅v^t+ϵm^t)8, optimizer, loss) with a single restart; then, increase δk+1=ProjS(δk+α⋅v^t+ϵm^t)9. On smooth losses (CIFAR-10 WResNet), MultiTargeted suffices with fewer restarts; on non-smooth (MNIST), more restarts are advantageous.
MultiTargeted: For models with α0 or those locally linear (adversarially trained), MultiTargeted with α1–α2 is optimal. Otherwise, use full α3 but reduce inner iterations to maintain constant total attack budget.
In all experiments, the robustness lower bound should be validated by combining PGD, MultiTargeted, and increased restarts as no single method universally saturates robust error.
7. Significance and Theoretical Implications
The identification and rigorous benchmarking of PGD hyperparameter effects have materially advanced adversarial robustness evaluation. Adam optimizer and α9-decay scheduling, used in conjunction with convex surrogate losses and MultiTargeted logic, define the current empirical frontier in white-box attack design. The guarantee that MultiTargeted with α0 restarts globally maximizes convex surrogates under local linearity places the method on a firm theoretical foundation. This suggests that for modern adversarially trained models, strategic multiplicity in attack targets, optimizer adaptivity, and calibrated decay schedules are critical for accurate robustness estimation (Gowal et al., 2019).
“Emergent Mind helps me see which AI papers have caught fire online.”
Philip
Creator, AI Explained on YouTube
Sign up for free to explore the frontiers of research
Discover trending papers, chat with arXiv, and track the latest research shaping the future of science and technology.Discover trending papers, chat with arXiv, and more.