ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs (2402.11753v2)

Published 19 Feb 2024 in cs.CL and cs.AI

Abstract: Safety is critical to the usage of LLMs. Multiple techniques such as data filtering and supervised fine-tuning have been developed to strengthen LLM safety. However, currently known techniques presume that corpora used for safety alignment of LLMs are solely interpreted by semantics. This assumption, however, does not hold in real-world applications, which leads to severe vulnerabilities in LLMs. For example, users of forums often use ASCII art, a form of text-based art, to convey image information. In this paper, we propose a novel ASCII art-based jailbreak attack and introduce a comprehensive benchmark Vision-in-Text Challenge (ViTC) to evaluate the capabilities of LLMs in recognizing prompts that cannot be solely interpreted by semantics. We show that five SOTA LLMs (GPT-3.5, GPT-4, Gemini, Claude, and Llama2) struggle to recognize prompts provided in the form of ASCII art. Based on this observation, we develop the jailbreak attack ArtPrompt, which leverages the poor performance of LLMs in recognizing ASCII art to bypass safety measures and elicit undesired behaviors from LLMs. ArtPrompt only requires black-box access to the victim LLMs, making it a practical attack. We evaluate ArtPrompt on five SOTA LLMs, and show that ArtPrompt can effectively and efficiently induce undesired behaviors from all five LLMs.

PDF Abstract

This paper, "ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs" (Jiang et al., 19 Feb 2024 ), investigates a vulnerability in LLMs related to their reliance on semantic interpretation for safety alignment. The authors reveal that current safety measures often assume text inputs are interpreted solely based on the meaning of the characters, overlooking alternative interpretations like visual structures formed by characters, such as ASCII art.

The core finding is that state-of-the-art LLMs struggle to recognize and process information presented in ASCII art format. To demonstrate this, the authors introduce the Vision-in-Text Challenge (ViTC) benchmark. ViTC includes two datasets, ViTC-S (single characters/digits) and ViTC-L (sequences of characters/digits), represented as ASCII art in various fonts. Evaluation of five prominent LLMs (GPT-3.5, GPT-4, Gemini, Claude, Llama2) on ViTC shows significantly low accuracy (Acc) and average match ratio (AMR) compared to their performance on traditional semantic tasks. For example, GPT-4 achieved only 25.19% Acc on ViTC-S and 3.26% Acc on ViTC-L. The paper also found that few-shot prompting and Chain-of-Thought (CoT) only provided marginal performance improvements on this recognition task.

Leveraging this observed weakness, the authors propose a novel jailbreak attack called ArtPrompt. ArtPrompt exploits the LLMs' inability to properly interpret ASCII art to bypass safety alignments and elicit undesired behaviors. The attack consists of two main steps:

Word Masking: Identify sensitive words in a harmful instruction that are likely to trigger refusal from the LLM. These words are masked (e.g., replacing "bomb" with "[MASK]").
Cloaked Prompt Generation: Replace the masked word with its representation in ASCII art, using an ASCII art generator. This ASCII art is then inserted back into the masked prompt to form a "cloaked prompt," which is sent to the victim LLM.

For practical implementation, ArtPrompt can be automated by integrating an ASCII art library. The cloaked prompts are also human-readable, making the attack potentially more stealthy than methods manipulating raw tokens. Unlike optimization-based attacks that require iterative searches, ArtPrompt can generate attack prompts more efficiently, often in a single step, by simply replacing the identified sensitive word with its ASCII art equivalent.

The effectiveness of ArtPrompt was evaluated against GPT-3.5, GPT-4, Claude, Gemini, and Llama2 using the AdvBench and HEx-PHI datasets, and compared against five baseline jailbreak attacks: Direct Instruction (DI), Greedy Coordinate Gradient (GCG), AutoDAN, PAIR, and DeepInception. ArtPrompt was evaluated in two configurations: "Top 1" (using a single best-performing font) and "Ensemble" (potentially using multiple fonts). The evaluation used metrics like Helpful Rate (HPR), Harmfulness Score (HS), and Attack Success Rate (ASR, specifically responses with HS=5).

Key findings from the attack evaluation include:

ArtPrompt is effective across all tested LLMs, achieving high average ASR (52% for Ensemble) compared to other attacks.
ArtPrompt demonstrates high efficiency, achieving its ASR with significantly fewer iterations (often 1) compared to optimization-based methods like GCG.
On the HEx-PHI dataset, ArtPrompt successfully induces unsafe behaviors across various prohibited categories, even in highly aligned models like GPT-4.

The paper also assessed ArtPrompt's resilience against existing defense mechanisms: Perplexity-based Detection (PPL-Pass), Paraphrase, and Retokenization (using BPE-dropout). The results showed that ArtPrompt successfully bypassed PPL-Pass and Retokenization on all tested models. Notably, Retokenization sometimes even increased ArtPrompt's effectiveness, hypothesized to be due to the introduction of spaces forming new ASCII-like patterns. Paraphrase was the most effective defense but still failed to completely mitigate ArtPrompt, which maintained a notable ASR and HS when Paraphrase was applied.

An ablation paper confirmed that the choice of font used for ASCII art significantly impacts the attack's effectiveness, with some fonts resulting in higher ASR than others. Arranging ASCII art vertically was found to reduce effectiveness compared to horizontal arrangements.

The authors conclude that the prevalent semantics-only interpretation in LLM safety alignment is a critical vulnerability. ArtPrompt demonstrates a practical and efficient method to exploit this weakness using ASCII art. The paper highlights the urgent need for developing more robust defense mechanisms that consider non-semantic interpretations of text inputs. The authors acknowledge the potential for misuse but emphasize the importance of this research for red-teaming and improving LLM safety. The code and prompts are intended for dissemination to aid further research in this area.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Fengqing Jiang (18 papers)
Zhangchen Xu (17 papers)
Luyao Niu (45 papers)
Zhen Xiang (42 papers)
Bhaskar Ramasubramanian (35 papers)
Bo Li (1107 papers)
Radha Poovendran (100 papers)

Citations (51)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - uw-nsl/ArtPrompt: Official Repo of ACL 2024 Paper `ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs` (46 stars)

Tweets

https://twitter.com/emollick/status/1763687813386547594

https://twitter.com/satoki00/status/1778269724469833992

https://twitter.com/fly51fly/status/1763831701636321536

https://twitter.com/DesobedienteTec/status/1763932884652474379

https://twitter.com/ds_ldn/status/1764693301582590264

https://twitter.com/MindBranches/status/1766532277121523713

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs (2402.11753v2)

Related Papers

GitHub

Tweets

YouTube

HackerNews

Reddit