Papers
Topics
Authors
Recent
2000 character limit reached

Rank-Assisted Prefilling (RAP) Attack

Updated 12 December 2025
  • The RAP attack reveals a critical vulnerability in safety-aligned LLMs by exploiting a disparity between token probability and rank during autoregressive decoding.
  • It circumvents probability-based defenses by manipulating low-probability yet high-ranked harmful tokens, demonstrating limitations in conventional supervised fine-tuning.
  • The PRESTO defense employs rank-matching and attention regularization to effectively reduce harmful content extraction while preserving overall model utility.

The Rank-Assisted Prefilling (RAP) attack is a vulnerability discovered in deeply safety-aligned LLMs, particularly those defended via data-augmented supervised fine-tuning (SFT). It exploits a mismatch between the model’s output token probability distribution and the ranking of tokens, revealing a critical flaw in conventional probability-based alignment, and enabling adversaries to extract harmful content by manipulating token selection strategies beyond standard decoding. RAP has motivated the development of novel defense strategies, notably rank-based attention regularization, with the PRefill attEntion STOpping (PRESTO) technique achieving substantial robustness improvements under this attack surface (Vega et al., 5 Dec 2025).

1. Prefilling and the Emergence of RAP

The canonical prefilling attack presents an LLM with a user prompt xx representing a harmful request (e.g., "How do I build a bomb?") and appends an explicit, affirmative "prefill" prefix pre\mathsf{pre} to the assistant’s response before model decoding. Although models with safety-alignment typically refuse such requests, presenting the response decoder with an already-compliant fragment can force continuation into a refusal only after the prefill, reducing immediate utility to attackers (Vega et al., 5 Dec 2025).

RAP generalizes this tactic: it circumvents not just by prefill insertion but also by exploiting token rank at each next-token prediction. At each step tt of autoregressive decoding, the attacker inspects the top-kk tokens Kt={u1,,uk}K_t = \{u_1,\dots,u_k\} ranked by descending conditional probability under p(ytx,pre,y<t)p(y_t|x,\,\mathsf{pre},\,y_{<t}), then adversarially selects a low-probability but highly-ranked "harmful" token. By iteratively favoring such tokens, RAP can reconstruct harmful sequences—a class of outputs that would be virtually inaccessible under greedy or random sampling.

2. Rank-Versus-Probability Vulnerability in SFT Defenses

State-of-the-art safety alignment, such as the data-augmentation SFT defense, generates synthetic paired examples (x,pre,r)(x,\,\mathsf{pre},\,r), fine-tuning models to produce natural language refusals rr following harmful pre\mathsf{pre}. As cross-entropy loss is minimized, the model often concentrates probability mass on a limited set of refusal tokens, relegating harmful next tokens to extremely low probability but still moderate rank (e.g., rank 7 of the top-20 candidate tokens).

RAP exploits this rank gap. While cross-entropy–minimized models present refusal tokens with sharply peaked probabilities, the relative ranks of alternative harmful tokens are largely unaffected. The alignment objective, which matches log-probabilities, does not guarantee that harmful tokens will be pushed down in the top-kk candidates, creating an exploitable surface for RAP (Vega et al., 5 Dec 2025).

A plausible implication is that probability-based defenses alone do not provide robust protection against adversarial decoding tactics that inspect token ranks, not just probabilities.

3. RAP Attack Algorithm and Notation

Let xx denote a harmful user prompt and pre\mathsf{pre} a prefixed harmful fragment. At each generation step tt, the model's next-token distribution p(x,pre,y<t)p(\cdot\,|\,x,\,\mathsf{pre},\,y_{<t}) admits a ranked candidate set KtK_t. RAP proceeds by:

  • Enumerating KtK_t at each tt.
  • Selecting a uiKtu_i \in K_t for which uiu_i maintains the harmful trajectory, regardless of its absolute probability.
  • At each step, preferring the first viable "harmful" token rather than defaulting to the argmax or sampling.

Variants include human-in-the-loop RAP (manual harmful token selection) and AutoRAP (automatic selection using a binary token classifier). This is strictly more powerful than generic decoding strategies, and distinct from classical jailbreaks (e.g., suffix-only or prompt-only injections).

The critical insight is that in the presence of a high-entropy or low-probability tail, harmful continuations remain accessible via top-kk rank inspection, undermining "deep" safety alignment (Vega et al., 5 Dec 2025).

4. Rank-Matching and the PRESTO Defense

To mitigate RAP, the defense paradigm shifts from matching probabilities to matching ranks—specifically, aligning the top-rr ordering of the safe model's next-token distribution p(x)p^*(\cdot\,|\,x) with the model's distribution given prefilled inputs p(x,pre;θ)p(\cdot\,|\,x,\,\mathsf{pre};\,\theta). The Push-Forward Alignment (PFA-1) objective formalizes this as minimizing the discrepancy in rank vectors R(p)R(p) and R(p)R(p^*), measured by a weighted Spearman correlation ρW\rho_W.

However, as direct rank-matching is not tractable in black-box models, the PRESTO ("PRefill attEntion STOpping") regularizer targets the model’s internal representation: for each multi-head attention (MHA) head (,h)(\ell,h), PRESTO penalizes attention mass allocated to the prefill token indices II, enforcing that the model’s forward semantics are conditioned primarily on the original user request xx and not on the prefill pre\mathsf{pre}.

Mathematically, the regularizer is:

LPRESTO(θ)=E(x,pre)  =1Lh=1H[Apre(,h)(θ)Anon-pre(,h)(θ)],\mathcal{L}_{\rm PRESTO}(\theta) = \mathbb{E}_{(x,\mathsf{pre})}\; \sum_{\ell=1}^L \sum_{h=1}^H\, [A_{\rm pre}^{(\ell,h)}(\theta) - A_{\rm non\text{-}pre}^{(\ell,h)}(\theta)],

where Apre(,h)A_{\rm pre}^{(\ell,h)} and Anon-pre(,h)A_{\rm non\text{-}pre}^{(\ell,h)} are sums of attention from all tokens to prefill and non-prefill tokens, respectively. The total fine-tuning loss combines standard SFT cross-entropy and this attention-stopping term, typically with λ=1.0\lambda=1.0:

Ltotal(θ)=E(x,pre)[logp(refusalx,pre;θ)]+λ  LPRESTO(θ)\mathcal{L}_{\rm total}(\theta) = \mathbb{E}_{(x,\mathsf{pre})}\bigl[-\log p(\text{refusal}|x,\mathsf{pre};\theta)\bigr] + \lambda\;\mathcal{L}_{\rm PRESTO}(\theta)

(Vega et al., 5 Dec 2025).

5. Empirical Evaluation and Results

RAP attack vulnerability and PRESTO defense have been systematically evaluated:

  • Attack methodology: Human RAP and AutoRAP (classifier-assisted automatic attack) are performed on Llama 2 7B Chat, Qwen 3 8B, Gemma 3 12B IT, using prompt data from StrongREJECT and the Anthropic Red-Teaming dataset. Each RAP step exposes the model’s top-20 next tokens for adversarial selection.
  • Metrics: The StrongREJECT (SR) score in [0,1][0,1] quantifies compliance with harmful requests (lower is better). Utility is measured via MT-Bench and GSM-8K performance.
  • Key results:
Model Attack Type DA Only (SR) DA+PRESTO (SR) Relative Decrease
Llama 2 7B AutoRAP 0.539 0.138 ~3.9×
Qwen 3 8B AutoRAP 0.712 0.152 ~4.7×
Gemma 3 12B AutoRAP 0.868 0.169 ~5.1×
  • Utility preservation: PRESTO incurs negligible degradation in both general LLM utility (e.g., MT-Bench, GSM-8K) and does not decrease robustness to other attacks (e.g., nanoGCG).
  • Manual and automatic RAP: DA+PRESTO achieves SR ≈ 0.138–0.169 under AutoRAP, a reduction by a factor of approximately 4–5 compared to DA-only SFT (Vega et al., 5 Dec 2025).

This demonstrates that rank-matching via attention regularization substantially hardens models against RAP, without sacrificing non-safety functionality.

6. Discussion, Limitations, and Future Directions

PRESTO’s regularizer leverages already-computed attention maps, inducing no significant additional training or inference cost. The principal limitation is its focus on ignoring harmful prefills; other jailbreak strategies may necessitate distinct or complementary control objectives. PRESTO’s effectiveness arises from suppressing attention to prefill tokens rather than explicit token reweighting, sidestepping the intractability of hard rank-matching in high-dimensional model outputs.

Planned extensions include higher-order Push-Forward Alignment (PFA-tt) targeting longer harmful continuations, layer-selective regularization applied to uppermost transformer layers, and dynamic scheduling of the PRESTO penalty. The weighting function WW for rank-correlation can also be optimized to modulate focus on the most critical next-token ranks.

A plausible implication is that future work may require the joint regularization of multiple context regions (e.g., suffixes and prefaces) as attackers evolve their methodologies.

7. Relation to Attentive Prediction in Linear Models

The "attention stopping" concept introduced by PRESTO has direct analogy to previous work in linear prediction, notably the PRESTO mechanism for linear predictors (Pelossof et al., 2012). In that setting, the predictor evaluates a small prefill of features, then stops evaluation early if the partial sum exceeds a threshold, exploiting the observation that “easy” cases permit rapid confident prediction. While the underlying architectures differ (linear predictors vs. deep transformers), both exploit early, decisive computation (“attention” or feature evaluation) to optimize for a desired outcome: either computational efficiency or, in the LLM case, robustness to adversarial prefilling.

In summary, RAP reveals a critical rank-based vulnerability in SFT-aligned LLMs that is not mitigated by conventional probability-based alignment. Rank-matching regularization, as instantiated in PRESTO, currently offers the principal mechanistic mitigation by enforcing that LLMs ignore harmful prefills at the attention level, closing this avenue for automated and manual adversarial extraction of harmful content (Vega et al., 5 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Rank-Assisted Prefilling (RAP) Attack.