Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 177 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

PGD and GCG Attacks Overview

Updated 5 November 2025
  • PGD and GCG attacks are iterative adversarial methods that use gradient-based updates to generate inputs, with PGD operating in continuous domains and GCG in discrete token spaces.
  • PGD attacks use gradient ascent combined with norm-bound projection for images, while GCG employs greedy, coordinate-wise updates for language models.
  • Both attack families feature advanced variants that improve efficiency, transferability, and robustness evaluation across vision and language applications.

Projected Gradient Descent (PGD) and Greedy Coordinate Gradient (GCG) attacks represent two of the most influential iterative optimization methodologies for constructing adversarial inputs to machine learning models, notably for deep image models and LLMs. Both attack families leverage first-order information but are tailored to fundamentally different domains, objective structures, and optimization constraints.

1. Definitional Distinctions and Core Mechanisms

PGD attacks are iterative white-box adversarial methods targeting continuous inputs (e.g., images). Each step performs a gradient ascent in input space with respect to a loss surrogate, followed by projection onto a norm-constrained set—most classically the \ell_\infty or 2\ell_2 ball: xt+1=ΠS(x)(xt+α  sign(xL(xt,y)))x^{t+1} = \Pi_{\mathcal{S}(x)}\left( x^t + \alpha \; \operatorname{sign}(\nabla_x \mathcal{L}(x^t, y)) \right) where L\mathcal{L} is a selected surrogate (e.g., cross-entropy, margin loss) and ΠS(x)\Pi_{\mathcal{S}(x)} projects onto a ball around xx.

GCG attacks (Greedy Coordinate Gradient) are discrete, token-space adversarial attacks, typically for text or LLM prompts. GCG iteratively updates a tokenized "suffix" (or adversarial input fragment) by greedily selecting, for each token position, the replacement that most increases a differentiable adversarial objective, typically via a proxy gradient in embedding space: S=argmaxSPθ(affirmative responseprompt+S)S^* = \arg \max_S P_\theta (\mathrm{affirmative\ response} \mid \mathrm{prompt} + S) For LLMs, this is implemented as a sequence of coordinate-wise greedy updates, sampling or evaluating the top candidate tokens per position according to the gradient with respect to output probability.

While PGD operates over continuous vector spaces and norm-bounded sets, GCG optimizes in discrete token spaces—necessitating coordinate-wise or heuristic relaxations for gradient-based optimization.

2. Algorithmic Families and Extensions

2.1 PGD Variants and Theoretical Developments

  • Standard PGD: Sign-based, constant step size, projection per step (Waghela et al., 20 Aug 2024).
  • Raw Gradient Descent (RGD): Uses full gradient magnitude without the sign operator, optimizing a hidden (unconstrained) state; avoids per-step projection (only projecting final outputs), yielding stronger attacks and more transferable perturbations (Yang et al., 2023).
  • Primal-Dual PGD (PDPGD): Optimizes both perturbation and Lagrangian multipliers for the original lpl_p norm minimization, supporting arbitrary lpl_p norms via proximal operators (Matyasko et al., 2021).
  • Low-Rank PGD (LoRa-PGD): Parameterizes the perturbation as δX=UCV\delta X = U \otimes_C V (explicit rank constraint), yielding attacks close to or outperforming full-rank PGD with substantially reduced memory and comparable computational cost, especially when using the nuclear norm as a budget (Savostianova et al., 16 Oct 2024).

Table: PGD Algorithmic Extensions

Variant Domain Update Mechanism Projection Use-case
Standard PGD Rn\mathbb{R}^n Sign(\nabla) + projection Per step Robustness, adversarial training
RGD Rn\mathbb{R}^n Raw \nabla (unconstrained state) Output only Strong, transferable attacks
PDPGD Rn\mathbb{R}^n Primal (prox-grad), dual ascent Varies Norm-minimizing, l0/l1l_0/l_1 threat
LoRa-PGD Rn\mathbb{R}^n Low-rank param. gradients Per step (in latent) Memory-efficient, large images

All variants outperform or supersede vanilla PGD for specific regimes (robustness evaluation, transferability, spectral trade-offs, optimization cost).

2.2 GCG and Discrete-Jailbreak Attack Advances

  • Standard GCG: Greedy per-token update using loss gradients in token or embedding space, batch-sampling of replacements, targeting fixed affirmative completions for initial feasibility (Li et al., 20 Oct 2024).
  • Faster-GCG: Introduces distance-regularized gradients (embedding proximity penalty), deterministic greedy sampling, deduplication of explored suffixes, and adoption of Carlini-Wagner loss (more effective than negative log-likelihood of fixed prefix). This achieves a 1/10–1/5 computational cost reduction and up to +29% absolute attack success rate improvements versus baseline GCG (Li et al., 20 Oct 2024).
  • T-GCG (Annealing-Augmented): Incorporates stochastic annealing during search (temperature-based sampling) to escape local minima, marginally improving attack diversity at the cost of diminishing gains at large model scales (Tan et al., 30 Aug 2025).
  • CoT-GCG: Replaces affirmative targets with chain-of-thought prompts, triggering multi-step reasoning modes that defeat refusal heuristics and increase attack transferability, especially to high-alignment or guarded LLMs (Su, 29 Oct 2024).
  • REINFORCE-GCG/PGD: Replaces static objectives with a distributional, semantic reward (harmfulness as judged by an LLM), using REINFORCE policy gradients to maximize probability of any harmful output. This doubles attack success rates over static objectives, even against advanced alignment (circuit breaker) defenses (Geisler et al., 24 Feb 2025).

Table: Recent GCG Attack Innovations

Variant Core Mechanism Key Innovation Efficiency Transferability
GCG Greedy coordinate update Cross-entropy loss Moderate Baseline
Faster-GCG Distance-regularized gradient CW loss, deduplication High Improved
T-GCG Annealing (stochastic updates) Diversity escape Moderate Slightly higher
CoT-GCG CoT targets, not affirmatives Reasoning trigger Slightly higher Much improved
REINFORCE-GCG Policy gradient (semantic) Model-adaptive, dist. Lower Robust to defenses

3. Loss Objectives, Surrogate Selection, and Robustness Evaluation

For PGD, the choice of surrogate loss directly influences attack strength and the accuracy of robustness evaluation. The paper on alternating objectives demonstrates non-monotonicity and no universal optimality among Cross-Entropy (CE), Carlini-Wagner (CW), and Difference of Logits Ratio (DLR) losses (Antoniou et al., 2022). Alternating between objectives (e.g., CE \to CW \to DLR) yields consistently stronger attacks across architectures, minimizes overestimation of robust accuracy, and matches or outperforms all white-box AutoAttack components under equal computational budgets.

For GCG, loss function improvements (e.g., moving from cross-entropy on static prefix targets to a CW loss or an adaptive, semantic REINFORCE reward) yield dramatic gains in attack realism and success, exposing the limitations of static, non-adaptive attack evluation (Geisler et al., 24 Feb 2025).

4. Domain-Specific Extensions and Applications

4.1 Vision

  • Image Classification: PGD, with and without randomization (e.g., WITCHcraft), has been essential for benchmarking and adversarial training (Chiang et al., 2019, Gowal et al., 2019). Advanced instantiations (low-rank, proximal, alternating objectives) reduce computation, enhance coverage, and better match real-world attack efficiency (Savostianova et al., 16 Oct 2024, Antoniou et al., 2022).
  • Image Segmentation: Targeted PGD achieves highly precise attacks, successfully diverting segmentation outputs to attacker-chosen masks—even with minimal, imperceptible perturbations—outperforming other segmentation-specific attacks (ASMA) especially in multiclass, complex output spaces (Vo et al., 2022).
  • Detection and Forensics: PGD-like attacks leave strong, characteristic traces, especially via increased local linearity in the network's response (captured by Adversarial Response Characteristics or ARC, and the Sequel Attack Effect), allowing robust detection of such attacks even with minimal data and without auxiliary networks (Zhou et al., 2022).

4.2 LLMs

  • Jailbreak Attacks: Both discrete (GCG/followup variants) and continuous-relaxation (PGD on prompt simplex) achieve high attack success, with continuous PGD delivering equivalent attack rates an order of magnitude faster—critical for scalable adversarial evaluation and training (Geisler et al., 14 Feb 2024, Li et al., 20 Oct 2024).
  • Prompt Injection and Exfiltration: GCG suffixes are implicated in successful cross-prompt injection attacks (XPIA), increasing successful data exfiltration by up to 20% on medium-alignment models, while effect fades for high-robustness models (e.g., GPT-4o) (Valbuena, 1 Aug 2024).
  • Increased Transferability: Syncretic techniques (CoT-GCG, semantic REINFORCE-GCG/PGD) substantially boost adversarial success rates especially under robust, reasoning-centric tasks and when employing rigorous, semantic evaluation metrics (e.g., Llama Guard, GPT-4o judge) (Su, 29 Oct 2024, Geisler et al., 24 Feb 2025, Tan et al., 30 Aug 2025).

5. Practical Implications, Evaluation Pitfalls, and Defense Considerations

  • Efficiency and Scalability: Randomized and low-rank PGD variants (e.g., WITCHcraft, LoRa-PGD) enable strong attacks or adversarial example generation at lower computational and memory costs—a requisite for adversarial training on high-dimensional input spaces and low-latency deployment (Chiang et al., 2019, Savostianova et al., 16 Oct 2024).
  • Benchmark Integrity: Overestimation of robustness is common when only naive surrogate losses or heuristic output checks ("no apology prefix") are used (Antoniou et al., 2022, Tan et al., 30 Aug 2025). Attack success must be reported under semantically rigorous metrics, ideally using independent, high-performing judges (e.g., GPT-4o, Llama Guard).
  • Resistance and Limitations: Advanced PGD and GCG variants (REINFORCE, T-GCG) expose vulnerabilities even in robust or circuit breaker-equipped models, but attack success rates markedly decrease as model size and alignment enforcement scale increase (Tan et al., 30 Aug 2025).
  • Backdoor Resilience: PGD-based adversarial training is robust against norm-bounded test-time manipulation but does not block backdoor attacks (poisoned training). However, the induced feature clustering can be exploited for post-hoc backdoor detection—an emergent advantage of robust training (Soremekun et al., 2020).

6. Open Directions and Controversies

  • Gradient Sign vs. Magnitude: It was previously conventional to use only the sign of the gradient for LL_\infty attacks (PGD, FGSM), but recent analysis demonstrates that, with correct hidden state optimization (RGD), full-magnitude raw gradients yield stronger, more transferable and more authentic adversarial examples (Yang et al., 2023).
  • Objective Diversification: There is no single surrogate loss universally optimal for all architectures or defense regimes. Alternating or ensemble approaches are superior, but the specifics of alternation (scheduling, combination) remain an area of ongoing paper (Antoniou et al., 2022).
  • Discrete Optimization vs. Continuous Relaxation in Language: Continuous-relaxation-based PGD for prompt space, with careful entropy control, bridges the efficiency gap to discrete token optimization, providing scalable, strong, and domain-agnostic adversarial evaluation for text models (Geisler et al., 14 Feb 2024).
  • Universal and Transferable Attacks: While some GCG variants exhibit high transferability between open- and closed-source models, effectiveness depends on architecture, alignment, and prompt class; no universal attack has yet achieved consistently high ASR against state-of-the-art closed-source LLMs (Li et al., 20 Oct 2024, Tan et al., 30 Aug 2025).

7. Summary Table: PGD vs. GCG Attack Families

Dimension PGD GCG
Input domain Continuous (e.g., image) Discrete (tokens, text prompts)
Update rule Gradient ascent + projection Greedy coordinate gradient (token by token)
Projection Onto norm-ball, per step or final Vocabulary set; discrete simplex
Typical constraint \ell_\infty, 2\ell_2, nuclear norm Length, syntax, token budget
Efficiency Fast, can be batched Expensive (unless optimized)
Adaptations Randomized step (WITCHcraft), low-rank, primal-dual, RGD Distance-regularization, annealing, CoT targets, REINFORCE
Applications Vision, control, audio, robustness eval LLM jailbreaking, data exfiltration, reasoning, prompt integrity
Defenses Adversarial training, spectral defenses Input filtering, output guard models (Llama Guard), alignments
Forensics ARC/SAE traces, clustering in robust models Currently no direct trace analog
Limitations Gradient masking, non-differentiable models, black-box Discretization, scalability to large models, robustness to advanced alignments

PGD and GCG attacks constitute complementary yet distinct optimization-based paradigms for adversarial evaluation. Their evolution—including algorithmic refinements, domain-specific enhancements, and the development of sophisticated evaluation methodologies—has both deepened the understanding of model vulnerability and complicated the benchmarking of robust and aligned systems. As model complexity increases and new defense paradigms emerge, continual refinement of both attack and evaluation strategies remains essential to accurately measure, and ultimately improve, real-world AI robustness.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to PGD and GCG Attacks.