Joint Optimization for Super Suffixes
- The paper introduces an alternating greedy coordinate-gradient search that jointly optimizes adversarial suffixes to bypass detection while maximizing targeted output probabilities.
- It employs a bi-objective token-level optimization balancing LLM generation and guard evasion, demonstrating transferability across architectures and tokenizations.
- Experiments reveal that super suffixes can boost guard benign scores to 93–96% and modestly reduce LLM refusal rates, highlighting vulnerabilities in current safety measures.
A joint optimization algorithm for Super Suffixes refers to a targeted procedure for constructing adversarial suffixes appended to prompts in LLM text generation, designed to simultaneously maximize the probability of eliciting specific (malicious) outputs from the generation model and evade detection by alignment or guard models. This approach advances adversarial prompt engineering against LLMs protected by auxiliary classifiers, demonstrating transferability across architectures and tokenization schemes, as well as the ability to bypass state-of-the-art guard mechanisms such as Llama Prompt Guard 2. The method introduces an alternating, greedy coordinate-gradient (GCG) search, effectively solving a constrained bi-objective token-level optimization that operates despite incompatible tokenizers between the generator and guard (Adiletta et al., 12 Dec 2025).
1. Problem Formulation
Let denote the target text-generation model (e.g., Vicuna, Gemma, Llama) and the associated classifier (Llama Prompt Guard 2). The adversary's input comprises a fixed prompt , possibly including a primary adversarial suffix, and a secondary suffix of length to be optimized over vocabulary .
The concatenated input is . Two loss terms are defined: where is a targeted (malicious) output token sequence.
The optimization objective is: subject to , . Parameters weight the generator and guard objectives respectively; is an optional regularizer (typically zero); is its weight. The formulation accommodates diverse tokenizations, prohibiting naive joint gradient computation and necessitating alternate optimization approaches.
2. Alternating Greedy Coordinate-Gradient (GCG) Strategy
The core of the algorithm is an alternating, beamless coordinate-wise HotFlip-style search:
- Linear Approximation: At each iteration , the loss function to approximate is selected:
For each suffix position , compute , select Top- tokens with the largest negative gradient directions.
- Batch Candidate Generation: single-token mutations are generated per iteration. For each, a random suffix position is chosen, replaced by a uniformly sampled token from its Top- set.
- Full-Model Evaluation: Each candidate prompt is evaluated on both generation and guard losses; the candidate minimizing the joint objective is selected.
- Alternation Schedule: Every iterations, the target loss for approximation toggles, preventing stagnation on one objective. Once the guard is "fooled" (benign probability ), approximation remains on exclusively.
- Hyperparameters: Experimentally, batch size , Top-, alternation window , guard threshold , and iteration budget –$300$. No learning rate is required due to greedy selection.
3. Pseudocode Summary
The algorithm is succinctly captured as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
Input: Prompt Q, reference output y*, guard threshold τ, alternation window N
Suffix indices I={n+1,…,n+k}, batch size B, top-K, weights α,γ, iterations T
Initialize x ← Q ∥ s (s random or from primary suffix)
for t in 1…T:
# 1) Choose which loss to approximate
if P_guard(x) < τ and ⌊t/N⌋ is odd:
L_approx ← L_guard
else:
L_approx ← L_gen
# 2) Linear approximation → Top-K at each position
for i in I:
compute gradient g_i = ∇_{e_{x_i}} L_approx(x)
X_i ← Top-K( −g_i )
# 3) Generate B one‐token mutation candidates
for b in 1…B:
pick i_b ← Uniform(I)
pick new token t′ ← Uniform(X_{i_b})
form candidate x^(b) by replacing x_{i_b}←t′
# 4) Full pass evaluation & update
b* ← argmin_b ( α L_gen(x^(b)) + γ L_guard(x^(b)) )
x ← x^(b*)
# 5) Optionally stop if P_guard(x) ≥ τ and L_gen(x) ≤ ε
end for
Output: optimized suffix s in x. |
This loop is robust to differing tokenization schemes, since all updates occur via coordinate-wise token substitutions with explicit model evaluation at each candidate step.
4. Convergence Intuitions and Theoretical Rationale
Two main intuitions explain the empirical effectiveness of the alternating greedy search:
- Loss Stability under Alternation: Alternating the primary (generation) and secondary (guard) loss ensures improvements in one objective do not irreparably degrade the other. Empirically, degradations to one objective during alternation are slight and the search recovers over subsequent cycles.
- Shared Linear Subspaces and Concept Directions: Analysis of hidden activations demonstrates that refusal and malicious-content signals are represented in stable linear subspaces in later layers. By driving activations along these "concept directions" in phase one (primary suffix optimization) and then manipulating the guard model’s decision boundary via the joint optimization (secondary suffix), the resultant suffixes show transferability across models and tokenization regimes.
A plausible implication is that successful attacks utilizing concept directions may generalize to unforeseen architectures, highlighting structural vulnerabilities in current alignment and guard model strategies.
5. Experimental Protocol and Key Findings
Experiments were conducted against five LLMs (Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, Vicuna-7B-v1.5, Google Gemma-2B-It, Microsoft Phi-3-Mini-128k), protected by Llama Prompt Guard 2 (86 M parameters). Evaluations used a newly proposed "Malicious Code Generation" dataset (99 attack, 99 benign code prompts) and HarmBench (100 general malicious prompts). Key metrics include refusal rate (proportion of prompts refused by the LLM) and average guard benign score ().
Representative Metrics
| Model | Refusal Rate (No Suffix) | Refusal Rate (Primary Suffix) | Guard Score (Primary) | Refusal Rate (Super Suffix) | Guard Score (Super Suffix) |
|---|---|---|---|---|---|
| Gemma | 97% | 13% | ~0.2–0.5 | 35% | ~93–96% benign |
| Vicuna | 35% | 0% | ~0.2–0.5 | — | ~93–96% benign |
| Llama-3.2-3B | 86% | 19% | ~0.2–0.5 | — | ~93–96% benign |
Super suffixes generated by the joint procedure notably increased guard "benign" predictions (to ≈93–96%), while only modestly increasing refusal rates; thus, text generation of malicious outputs persisted but went undetected by the guard. Generation time per attack ranged from 9–85 minutes (\$0.2–\$2.1 using GH200 GPUs).
In summary, the alternating GCG joint optimization produces Super Suffixes that reliably induce malicious outputs from LLMs while simultaneously evading detection by contemporary guard architectures, even in the presence of architectural and tokenization variability (Adiletta et al., 12 Dec 2025).
6. Broader Impact and Defensive Countermeasures
The paper’s demonstration that existing guard models, including Llama Prompt Guard 2, can be bypassed through joint optimization of Super Suffixes, underscores critical gaps in current adversarial robustness of LLM deployment pipelines. Analysis of internal model state dynamics further enables effective detection of Super Suffix attacks via cosine similarity measurement between the residual stream and concept directions ("DeltaGuard"), which raised the non-benign classification rate nearly to 100%, strengthening the defensive stack.
A plausible implication is that such attack methods compel deployment of more sophisticated layered detection—such as DeltaGuard—or adversarial training tailored to concept vector manipulations, in pursuit of enhanced LLM safety and alignment guarantees.