Rank-Assisted Prefilling (RAP) Attack
- The RAP attack reveals a critical vulnerability in safety-aligned LLMs by exploiting a disparity between token probability and rank during autoregressive decoding.
- It circumvents probability-based defenses by manipulating low-probability yet high-ranked harmful tokens, demonstrating limitations in conventional supervised fine-tuning.
- The PRESTO defense employs rank-matching and attention regularization to effectively reduce harmful content extraction while preserving overall model utility.
The Rank-Assisted Prefilling (RAP) attack is a vulnerability discovered in deeply safety-aligned LLMs, particularly those defended via data-augmented supervised fine-tuning (SFT). It exploits a mismatch between the model’s output token probability distribution and the ranking of tokens, revealing a critical flaw in conventional probability-based alignment, and enabling adversaries to extract harmful content by manipulating token selection strategies beyond standard decoding. RAP has motivated the development of novel defense strategies, notably rank-based attention regularization, with the PRefill attEntion STOpping (PRESTO) technique achieving substantial robustness improvements under this attack surface (Vega et al., 5 Dec 2025).
1. Prefilling and the Emergence of RAP
The canonical prefilling attack presents an LLM with a user prompt representing a harmful request (e.g., "How do I build a bomb?") and appends an explicit, affirmative "prefill" prefix to the assistant’s response before model decoding. Although models with safety-alignment typically refuse such requests, presenting the response decoder with an already-compliant fragment can force continuation into a refusal only after the prefill, reducing immediate utility to attackers (Vega et al., 5 Dec 2025).
RAP generalizes this tactic: it circumvents not just by prefill insertion but also by exploiting token rank at each next-token prediction. At each step of autoregressive decoding, the attacker inspects the top- tokens ranked by descending conditional probability under , then adversarially selects a low-probability but highly-ranked "harmful" token. By iteratively favoring such tokens, RAP can reconstruct harmful sequences—a class of outputs that would be virtually inaccessible under greedy or random sampling.
2. Rank-Versus-Probability Vulnerability in SFT Defenses
State-of-the-art safety alignment, such as the data-augmentation SFT defense, generates synthetic paired examples , fine-tuning models to produce natural language refusals following harmful . As cross-entropy loss is minimized, the model often concentrates probability mass on a limited set of refusal tokens, relegating harmful next tokens to extremely low probability but still moderate rank (e.g., rank 7 of the top-20 candidate tokens).
RAP exploits this rank gap. While cross-entropy–minimized models present refusal tokens with sharply peaked probabilities, the relative ranks of alternative harmful tokens are largely unaffected. The alignment objective, which matches log-probabilities, does not guarantee that harmful tokens will be pushed down in the top- candidates, creating an exploitable surface for RAP (Vega et al., 5 Dec 2025).
A plausible implication is that probability-based defenses alone do not provide robust protection against adversarial decoding tactics that inspect token ranks, not just probabilities.
3. RAP Attack Algorithm and Notation
Let denote a harmful user prompt and a prefixed harmful fragment. At each generation step , the model's next-token distribution admits a ranked candidate set . RAP proceeds by:
- Enumerating at each .
- Selecting a for which maintains the harmful trajectory, regardless of its absolute probability.
- At each step, preferring the first viable "harmful" token rather than defaulting to the argmax or sampling.
Variants include human-in-the-loop RAP (manual harmful token selection) and AutoRAP (automatic selection using a binary token classifier). This is strictly more powerful than generic decoding strategies, and distinct from classical jailbreaks (e.g., suffix-only or prompt-only injections).
The critical insight is that in the presence of a high-entropy or low-probability tail, harmful continuations remain accessible via top- rank inspection, undermining "deep" safety alignment (Vega et al., 5 Dec 2025).
4. Rank-Matching and the PRESTO Defense
To mitigate RAP, the defense paradigm shifts from matching probabilities to matching ranks—specifically, aligning the top- ordering of the safe model's next-token distribution with the model's distribution given prefilled inputs . The Push-Forward Alignment (PFA-1) objective formalizes this as minimizing the discrepancy in rank vectors and , measured by a weighted Spearman correlation .
However, as direct rank-matching is not tractable in black-box models, the PRESTO ("PRefill attEntion STOpping") regularizer targets the model’s internal representation: for each multi-head attention (MHA) head , PRESTO penalizes attention mass allocated to the prefill token indices , enforcing that the model’s forward semantics are conditioned primarily on the original user request and not on the prefill .
Mathematically, the regularizer is:
where and are sums of attention from all tokens to prefill and non-prefill tokens, respectively. The total fine-tuning loss combines standard SFT cross-entropy and this attention-stopping term, typically with :
5. Empirical Evaluation and Results
RAP attack vulnerability and PRESTO defense have been systematically evaluated:
- Attack methodology: Human RAP and AutoRAP (classifier-assisted automatic attack) are performed on Llama 2 7B Chat, Qwen 3 8B, Gemma 3 12B IT, using prompt data from StrongREJECT and the Anthropic Red-Teaming dataset. Each RAP step exposes the model’s top-20 next tokens for adversarial selection.
- Metrics: The StrongREJECT (SR) score in quantifies compliance with harmful requests (lower is better). Utility is measured via MT-Bench and GSM-8K performance.
- Key results:
| Model | Attack Type | DA Only (SR) | DA+PRESTO (SR) | Relative Decrease |
|---|---|---|---|---|
| Llama 2 7B | AutoRAP | 0.539 | 0.138 | ~3.9× |
| Qwen 3 8B | AutoRAP | 0.712 | 0.152 | ~4.7× |
| Gemma 3 12B | AutoRAP | 0.868 | 0.169 | ~5.1× |
- Utility preservation: PRESTO incurs negligible degradation in both general LLM utility (e.g., MT-Bench, GSM-8K) and does not decrease robustness to other attacks (e.g., nanoGCG).
- Manual and automatic RAP: DA+PRESTO achieves SR ≈ 0.138–0.169 under AutoRAP, a reduction by a factor of approximately 4–5 compared to DA-only SFT (Vega et al., 5 Dec 2025).
This demonstrates that rank-matching via attention regularization substantially hardens models against RAP, without sacrificing non-safety functionality.
6. Discussion, Limitations, and Future Directions
PRESTO’s regularizer leverages already-computed attention maps, inducing no significant additional training or inference cost. The principal limitation is its focus on ignoring harmful prefills; other jailbreak strategies may necessitate distinct or complementary control objectives. PRESTO’s effectiveness arises from suppressing attention to prefill tokens rather than explicit token reweighting, sidestepping the intractability of hard rank-matching in high-dimensional model outputs.
Planned extensions include higher-order Push-Forward Alignment (PFA-) targeting longer harmful continuations, layer-selective regularization applied to uppermost transformer layers, and dynamic scheduling of the PRESTO penalty. The weighting function for rank-correlation can also be optimized to modulate focus on the most critical next-token ranks.
A plausible implication is that future work may require the joint regularization of multiple context regions (e.g., suffixes and prefaces) as attackers evolve their methodologies.
7. Relation to Attentive Prediction in Linear Models
The "attention stopping" concept introduced by PRESTO has direct analogy to previous work in linear prediction, notably the PRESTO mechanism for linear predictors (Pelossof et al., 2012). In that setting, the predictor evaluates a small prefill of features, then stops evaluation early if the partial sum exceeds a threshold, exploiting the observation that “easy” cases permit rapid confident prediction. While the underlying architectures differ (linear predictors vs. deep transformers), both exploit early, decisive computation (“attention” or feature evaluation) to optimize for a desired outcome: either computational efficiency or, in the LLM case, robustness to adversarial prefilling.
In summary, RAP reveals a critical rank-based vulnerability in SFT-aligned LLMs that is not mitigated by conventional probability-based alignment. Rank-matching regularization, as instantiated in PRESTO, currently offers the principal mechanistic mitigation by enforcing that LLMs ignore harmful prefills at the attention level, closing this avenue for automated and manual adversarial extraction of harmful content (Vega et al., 5 Dec 2025).