Jailbreaking Large Language Models with Symbolic Mathematics (2409.11445v2)

Published 17 Sep 2024 in cs.CR, cs.AI, cs.CL, and cs.LG

Abstract: Recent advancements in AI safety have led to increased efforts in training and red-teaming LLMs to mitigate unsafe content generation. However, these safety mechanisms may not be comprehensive, leaving potential vulnerabilities unexplored. This paper introduces MathPrompt, a novel jailbreaking technique that exploits LLMs' advanced capabilities in symbolic mathematics to bypass their safety mechanisms. By encoding harmful natural language prompts into mathematical problems, we demonstrate a critical vulnerability in current AI safety measures. Our experiments across 13 state-of-the-art LLMs reveal an average attack success rate of 73.6\%, highlighting the inability of existing safety training mechanisms to generalize to mathematically encoded inputs. Analysis of embedding vectors shows a substantial semantic shift between original and encoded prompts, helping explain the attack's success. This work emphasizes the importance of a holistic approach to AI safety, calling for expanded red-teaming efforts to develop robust safeguards across all potential input types and their associated risks.

PDF Abstract

Jailbreaking LLMs with Symbolic Mathematics

The paper explores the vulnerabilities in AI safety mechanisms of LLMs by introducing MathPrompt, a novel technique that leverages symbolic mathematics to bypass these safety measures. Emet Bethany and colleagues investigate how LLMs, despite being fortified with sophisticated safety mechanisms via Reinforcement Learning from Human Feedback (RLHF) and extensive red-teaming, remain susceptible to certain types of attacks that have been largely unexamined.

Key Contributions

The authors present several critical findings:

MathPrompt Technique: They formulated a method to encode harmful natural language prompts into symbolic mathematics problems, which can deceive LLMs into generating unsafe outputs. By transforming questions such as "How to rob a bank" into a mathematically encoded format, the safety mechanisms of LLMs were unable to effectively recognize and block the harmful intent.
Experimental Evaluation: Conducting experiments across 13 state-of-the-art LLMs, they demonstrated an alarming average attack success rate of 73.6%. These models included major proprietary and open-source LLMs, namely GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. The efficacy of MathPrompt was consistent across the board, regardless of model-specific safety configurations.
Embedding Analysis: The paper employed embedding vectors to compare the semantic relationship between original and encoded prompts, finding a low average cosine similarity of 0.2705. This significant semantic difference explains why current AI safety mechanisms fail to detect the harmful nature of mathematically encoded prompts.

Methodology

The methodology involved several intricate steps:

Transformation of Prompts: Natural language instructions were represented using concepts from set theory, abstract algebra, and symbolic logic. This allowed the harmful content to be framed as a math problem, significantly altering its surface form while preserving its semantic core.
Few-shot Learning: A few-shot learning approach using GPT-4o helped demonstrate the transformation process, enabling the model to generalize this mapping to new inputs effectively.
Prepended Instructions: Specific instructions were added to the math problems, directing the target LLMs to solve the problem and provide real-world examples, thereby providing a context in which the LLMs might generate the harmful content unwittingly.

Experimental Findings

Safety Training Inadequacy: The results show a profound gap in current safety strategies. Even with different models' safety settings adjusted, the mathematical encoding of harmful prompts successfully bypassed these safeguards.

Model Vulnerability: Analyzing multiple models, the paper found no strong correlation between the model's size or its purported capabilities and its resistance to MathPrompt attacks. Interestingly, the highest attack success rates did not follow a consistent pattern across different families of LLMs.

Broader Implications

The implications of this research are twofold. Practically, current AI safety measures need significant enhancement to handle a broader spectrum of inputs, including symbolic mathematics. Theoretically, this paper opens discussions about the depth of understanding required in LLMs to ensure robust safety mechanisms.

Future Directions

Future research could expand on several fronts:

Diverse Input Types: Exploring other forms of symbolic manipulation beyond set theory and abstract algebra could reveal more vulnerabilities.
Multimodal Inputs: Considering the integration of multimodal inputs to understand how models handle combined textual and mathematical inputs may lead to more comprehensive safeguards.
Enhanced Safety Mechanisms: Developing advanced AI safety mechanisms that can scrutinize and disarm mathematically encoded harmful prompts.

Conclusion

This paper reveals a significant shortcoming in LLM safety mechanisms through MathPrompt, emphasizing the urgent need for more comprehensive and robust measures. By showing a high average attack success rate and explaining the semantic shift in encoded prompts, the authors provide critical insights that could greatly impact the future development of AI safety protocols.