PGD and GCG Attacks Overview

Updated 5 November 2025

PGD and GCG attacks are iterative adversarial methods that use gradient-based updates to generate inputs, with PGD operating in continuous domains and GCG in discrete token spaces.
PGD attacks use gradient ascent combined with norm-bound projection for images, while GCG employs greedy, coordinate-wise updates for language models.
Both attack families feature advanced variants that improve efficiency, transferability, and robustness evaluation across vision and language applications.

Projected Gradient Descent (PGD) and Greedy Coordinate Gradient (GCG) attacks represent two of the most influential iterative optimization methodologies for constructing adversarial inputs to machine learning models, notably for deep image models and LLMs. Both attack families leverage first-order information but are tailored to fundamentally different domains, objective structures, and optimization constraints.

1. Definitional Distinctions and Core Mechanisms

PGD attacks are iterative white-box adversarial methods targeting continuous inputs (e.g., images). Each step performs a gradient ascent in input space with respect to a loss surrogate, followed by projection onto a norm-constrained set—most classically the $\ell_\infty$ or $\ell_2$ ball: $x^{t+1} = \Pi_{\mathcal{S}(x)}\left( x^t + \alpha \; \operatorname{sign}(\nabla_x \mathcal{L}(x^t, y)) \right)$ where $\mathcal{L}$ is a selected surrogate (e.g., cross-entropy, margin loss) and $\Pi_{\mathcal{S}(x)}$ projects onto a ball around $x$ .

GCG attacks (Greedy Coordinate Gradient) are discrete, token-space adversarial attacks, typically for text or LLM prompts. GCG iteratively updates a tokenized "suffix" (or adversarial input fragment) by greedily selecting, for each token position, the replacement that most increases a differentiable adversarial objective, typically via a proxy gradient in embedding space: $S^* = \arg \max_S P_\theta (\mathrm{affirmative\ response} \mid \mathrm{prompt} + S)$ For LLMs, this is implemented as a sequence of coordinate-wise greedy updates, sampling or evaluating the top candidate tokens per position according to the gradient with respect to output probability.

While PGD operates over continuous vector spaces and norm-bounded sets, GCG optimizes in discrete token spaces—necessitating coordinate-wise or heuristic relaxations for gradient-based optimization.

2. Algorithmic Families and Extensions

2.1 PGD Variants and Theoretical Developments

Standard PGD: Sign-based, constant step size, projection per step (Waghela et al., 20 Aug 2024).
Raw Gradient Descent (RGD): Uses full gradient magnitude without the sign operator, optimizing a hidden (unconstrained) state; avoids per-step projection (only projecting final outputs), yielding stronger attacks and more transferable perturbations (Yang et al., 2023).
Primal-Dual PGD (PDPGD): Optimizes both perturbation and Lagrangian multipliers for the original $l_p$ norm minimization, supporting arbitrary $l_p$ norms via proximal operators (Matyasko et al., 2021).
Low-Rank PGD (LoRa-PGD): Parameterizes the perturbation as $\delta X = U \otimes_C V$ (explicit rank constraint), yielding attacks close to or outperforming full-rank PGD with substantially reduced memory and comparable computational cost, especially when using the nuclear norm as a budget (Savostianova et al., 16 Oct 2024).

Table: PGD Algorithmic Extensions

Variant	Domain	Update Mechanism	Projection	Use-case
Standard PGD	$\mathbb{R}^n$	Sign( $\nabla$ ) + projection	Per step	Robustness, adversarial training
RGD	$\mathbb{R}^n$	Raw $\nabla$ (unconstrained state)	Output only	Strong, transferable attacks
PDPGD	$\mathbb{R}^n$	Primal (prox-grad), dual ascent	Varies	Norm-minimizing, $l_0/l_1$ threat
LoRa-PGD	$\mathbb{R}^n$	Low-rank param. gradients	Per step (in latent)	Memory-efficient, large images

All variants outperform or supersede vanilla PGD for specific regimes (robustness evaluation, transferability, spectral trade-offs, optimization cost).

2.2 GCG and Discrete-Jailbreak Attack Advances

Standard GCG: Greedy per-token update using loss gradients in token or embedding space, batch-sampling of replacements, targeting fixed affirmative completions for initial feasibility (Li et al., 20 Oct 2024).
Faster-GCG: Introduces distance-regularized gradients (embedding proximity penalty), deterministic greedy sampling, deduplication of explored suffixes, and adoption of Carlini-Wagner loss (more effective than negative log-likelihood of fixed prefix). This achieves a 1/10–1/5 computational cost reduction and up to +29% absolute attack success rate improvements versus baseline GCG (Li et al., 20 Oct 2024).
T-GCG (Annealing-Augmented): Incorporates stochastic annealing during search (temperature-based sampling) to escape local minima, marginally improving attack diversity at the cost of diminishing gains at large model scales (Tan et al., 30 Aug 2025).
CoT-GCG: Replaces affirmative targets with chain-of-thought prompts, triggering multi-step reasoning modes that defeat refusal heuristics and increase attack transferability, especially to high-alignment or guarded LLMs (Su, 29 Oct 2024).
REINFORCE-GCG/PGD: Replaces static objectives with a distributional, semantic reward (harmfulness as judged by an LLM), using REINFORCE policy gradients to maximize probability of any harmful output. This doubles attack success rates over static objectives, even against advanced alignment (circuit breaker) defenses (Geisler et al., 24 Feb 2025).

Table: Recent GCG Attack Innovations

Variant	Core Mechanism	Key Innovation	Efficiency	Transferability
GCG	Greedy coordinate update	Cross-entropy loss	Moderate	Baseline
Faster-GCG	Distance-regularized gradient	CW loss, deduplication	High	Improved
T-GCG	Annealing (stochastic updates)	Diversity escape	Moderate	Slightly higher
CoT-GCG	CoT targets, not affirmatives	Reasoning trigger	Slightly higher	Much improved
REINFORCE-GCG	Policy gradient (semantic)	Model-adaptive, dist.	Lower	Robust to defenses

3. Loss Objectives, Surrogate Selection, and Robustness Evaluation

For PGD, the choice of surrogate loss directly influences attack strength and the accuracy of robustness evaluation. The paper on alternating objectives demonstrates non-monotonicity and no universal optimality among Cross-Entropy (CE), Carlini-Wagner (CW), and Difference of Logits Ratio (DLR) losses (Antoniou et al., 2022). Alternating between objectives (e.g., CE $\to$ CW $\to$ DLR) yields consistently stronger attacks across architectures, minimizes overestimation of robust accuracy, and matches or outperforms all white-box AutoAttack components under equal computational budgets.

For GCG, loss function improvements (e.g., moving from cross-entropy on static prefix targets to a CW loss or an adaptive, semantic REINFORCE reward) yield dramatic gains in attack realism and success, exposing the limitations of static, non-adaptive attack evluation (Geisler et al., 24 Feb 2025).

4. Domain-Specific Extensions and Applications

4.1 Vision

Image Classification: PGD, with and without randomization (e.g., WITCHcraft), has been essential for benchmarking and adversarial training (Chiang et al., 2019, Gowal et al., 2019). Advanced instantiations (low-rank, proximal, alternating objectives) reduce computation, enhance coverage, and better match real-world attack efficiency (Savostianova et al., 16 Oct 2024, Antoniou et al., 2022).
Image Segmentation: Targeted PGD achieves highly precise attacks, successfully diverting segmentation outputs to attacker-chosen masks—even with minimal, imperceptible perturbations—outperforming other segmentation-specific attacks (ASMA) especially in multiclass, complex output spaces (Vo et al., 2022).
Detection and Forensics: PGD-like attacks leave strong, characteristic traces, especially via increased local linearity in the network's response (captured by Adversarial Response Characteristics or ARC, and the Sequel Attack Effect), allowing robust detection of such attacks even with minimal data and without auxiliary networks (Zhou et al., 2022).

4.2 LLMs

Jailbreak Attacks: Both discrete (GCG/followup variants) and continuous-relaxation (PGD on prompt simplex) achieve high attack success, with continuous PGD delivering equivalent attack rates an order of magnitude faster—critical for scalable adversarial evaluation and training (Geisler et al., 14 Feb 2024, Li et al., 20 Oct 2024).
Prompt Injection and Exfiltration: GCG suffixes are implicated in successful cross-prompt injection attacks (XPIA), increasing successful data exfiltration by up to 20% on medium-alignment models, while effect fades for high-robustness models (e.g., GPT-4o) (Valbuena, 1 Aug 2024).
Increased Transferability: Syncretic techniques (CoT-GCG, semantic REINFORCE-GCG/PGD) substantially boost adversarial success rates especially under robust, reasoning-centric tasks and when employing rigorous, semantic evaluation metrics (e.g., Llama Guard, GPT-4o judge) (Su, 29 Oct 2024, Geisler et al., 24 Feb 2025, Tan et al., 30 Aug 2025).

5. Practical Implications, Evaluation Pitfalls, and Defense Considerations

Efficiency and Scalability: Randomized and low-rank PGD variants (e.g., WITCHcraft, LoRa-PGD) enable strong attacks or adversarial example generation at lower computational and memory costs—a requisite for adversarial training on high-dimensional input spaces and low-latency deployment (Chiang et al., 2019, Savostianova et al., 16 Oct 2024).
Benchmark Integrity: Overestimation of robustness is common when only naive surrogate losses or heuristic output checks ("no apology prefix") are used (Antoniou et al., 2022, Tan et al., 30 Aug 2025). Attack success must be reported under semantically rigorous metrics, ideally using independent, high-performing judges (e.g., GPT-4o, Llama Guard).
Resistance and Limitations: Advanced PGD and GCG variants (REINFORCE, T-GCG) expose vulnerabilities even in robust or circuit breaker-equipped models, but attack success rates markedly decrease as model size and alignment enforcement scale increase (Tan et al., 30 Aug 2025).
Backdoor Resilience: PGD-based adversarial training is robust against norm-bounded test-time manipulation but does not block backdoor attacks (poisoned training). However, the induced feature clustering can be exploited for post-hoc backdoor detection—an emergent advantage of robust training (Soremekun et al., 2020).

6. Open Directions and Controversies

Gradient Sign vs. Magnitude: It was previously conventional to use only the sign of the gradient for $L_\infty$ attacks (PGD, FGSM), but recent analysis demonstrates that, with correct hidden state optimization (RGD), full-magnitude raw gradients yield stronger, more transferable and more authentic adversarial examples (Yang et al., 2023).
Objective Diversification: There is no single surrogate loss universally optimal for all architectures or defense regimes. Alternating or ensemble approaches are superior, but the specifics of alternation (scheduling, combination) remain an area of ongoing paper (Antoniou et al., 2022).
Discrete Optimization vs. Continuous Relaxation in Language: Continuous-relaxation-based PGD for prompt space, with careful entropy control, bridges the efficiency gap to discrete token optimization, providing scalable, strong, and domain-agnostic adversarial evaluation for text models (Geisler et al., 14 Feb 2024).
Universal and Transferable Attacks: While some GCG variants exhibit high transferability between open- and closed-source models, effectiveness depends on architecture, alignment, and prompt class; no universal attack has yet achieved consistently high ASR against state-of-the-art closed-source LLMs (Li et al., 20 Oct 2024, Tan et al., 30 Aug 2025).

7. Summary Table: PGD vs. GCG Attack Families

Dimension	PGD	GCG
Input domain	Continuous (e.g., image)	Discrete (tokens, text prompts)
Update rule	Gradient ascent + projection	Greedy coordinate gradient (token by token)
Projection	Onto norm-ball, per step or final	Vocabulary set; discrete simplex
Typical constraint	$\ell_\infty$ , $\ell_2$ , nuclear norm	Length, syntax, token budget
Efficiency	Fast, can be batched	Expensive (unless optimized)
Adaptations	Randomized step (WITCHcraft), low-rank, primal-dual, RGD	Distance-regularization, annealing, CoT targets, REINFORCE
Applications	Vision, control, audio, robustness eval	LLM jailbreaking, data exfiltration, reasoning, prompt integrity
Defenses	Adversarial training, spectral defenses	Input filtering, output guard models (Llama Guard), alignments
Forensics	ARC/SAE traces, clustering in robust models	Currently no direct trace analog
Limitations	Gradient masking, non-differentiable models, black-box	Discretization, scalability to large models, robustness to advanced alignments

PGD and GCG attacks constitute complementary yet distinct optimization-based paradigms for adversarial evaluation. Their evolution—including algorithmic refinements, domain-specific enhancements, and the development of sophisticated evaluation methodologies—has both deepened the understanding of model vulnerability and complicated the benchmarking of robust and aligned systems. As model complexity increases and new defense paradigms emerge, continual refinement of both attack and evaluation strategies remains essential to accurately measure, and ultimately improve, real-world AI robustness.