Effect of Retokenization-Induced Spacing on ArtPrompt’s Success

Determine whether the additional spaces introduced by Byte Pair Encoding (BPE) dropout retokenization effectively function as a new ASCII art font for the ArtPrompt jailbreak attack, thereby reducing the likelihood of triggering safety measures in the aligned large language models evaluated in the paper (GPT-3.5 0613, GPT-4 0613, Claude v2, Gemini Pro, and Llama2 Chat-7B).

Background

The authors evaluate defenses against the ArtPrompt jailbreak attack, including a retokenization defense using BPE-dropout. Surprisingly, they find that retokenization does not mitigate ArtPrompt and may even increase its attack success rate across models.

They explicitly conjecture a causal mechanism: that added spaces from retokenization effectively create a new ASCII-art-like font, which decreases the chance that safety measures detect the harmful content embedded in the cloaked prompt. This causal explanation remains unverified and is posed as a conjecture in the paper.

References

We note that Retokenization may even help ArtPrompt to improve ASR. We conjecture that this is because the spaces introduced by Retokenization forms a new font for ArtPrompt, which further reduces the chance of triggering safety measures deployed by victim models.

— ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs (2402.11753 - Jiang et al., 2024) in Section 4.2 Experimental Results, paragraph "ArtPrompt can bypass existing defenses against jailbreak attacks" (following Table titled "This table presents the effectiveness of ArtPrompt when PPL, Paraphrase, or Retokenization is employed by victim LLMs.")

Effect of Retokenization-Induced Spacing on ArtPrompt’s Success

Sponsor

Background

References

Related Problems