Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs (2504.19019v1)

Published 26 Apr 2025 in cs.CL, cs.AI, and cs.CR

Abstract: The challenge of ensuring LLMs align with societal standards is of increasing interest, as these models are still prone to adversarial jailbreaks that bypass their safety mechanisms. Identifying these vulnerabilities is crucial for enhancing the robustness of LLMs against such exploits. We propose Graph of ATtacks (GoAT), a method for generating adversarial prompts to test the robustness of LLM alignment using the Graph of Thoughts framework [Besta et al., 2024]. GoAT excels at generating highly effective jailbreak prompts with fewer queries to the victim model than state-of-the-art attacks, achieving up to five times better jailbreak success rate against robust models like Llama. Notably, GoAT creates high-quality, human-readable prompts without requiring access to the targeted model's parameters, making it a black-box attack. Unlike approaches constrained by tree-based reasoning, GoAT's reasoning is based on a more intricate graph structure. By making simultaneous attack paths aware of each other's progress, this dynamic framework allows a deeper integration and refinement of reasoning paths, significantly enhancing the collaborative exploration of adversarial vulnerabilities in LLMs. At a technical level, GoAT starts with a graph structure and iteratively refines it by combining and improving thoughts, enabling synergy between different thought paths. The code for our implementation can be found at: https://github.com/GoAT-pydev/Graph_of_Attacks.

Summary

The paper introduces Graph of Attacks (GoAT), a novel graph-based method leveraging the Graph of Thoughts framework to generate effective black-box jailbreak prompts for LLMs.
GoAT demonstrates high efficacy, achieving up to five times better success rates against robust LLMs and reducing queries compared to existing state-of-the-art jailbreaking techniques.
The method's ability to generate interpretable prompts and effectively handle black-box models offers a valuable tool for security researchers aiming to understand and enhance LLM robustness and safety.

Evaluation of Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs

The paper, "Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs," addresses a significant concern within NLP: the robustness and alignment of LLMs against adversarial attacks. The authors propose a novel approach termed Graph of Attacks (GoAT), leveraging the Graph of Thoughts (GoT) framework to generate effective adversarial prompts capable of jailbreaking LLMs' safety mechanisms.

Key Contributions and Results

The primary contribution of GoAT lies in its innovative utilization of the graph-based structure to streamline adversarial prompt generation. Unlike existing methods that rely on constrained reasoning approaches, GoAT integrates simultaneous attack paths, allowing coordination and refinement of thought processes. This collaboration across paths is crucial in discovering adversarial vulnerabilities and bypassing robust models, such as Llama.

The results demonstrated in the paper are compelling, showcasing GoAT's high efficacy in generating successful jailbreak prompts with reduced queries compared to state-of-the-art methods. For example, GoAT achieves up to five times better success rates against robust LLMs than existing techniques. Notable empirical evidence also illustrates the capability of GoAT to handle black-box attacks effectively, without needing access to the target model's parameters.

Methodology and Implementation

GoAT's methodology builds upon the Graph of Thoughts framework, emphasizing dynamic reasoning across interconnected paths. This approach differs significantly from traditional tree-based reasoning frameworks by allowing information sharing across multiple reasoning paths, thereby enhancing the exploration space for adversarial prompt generation.

At a technical level, GoAT begins with initiating a graph structure and iteratively refines it by merging and improving thoughts from various paths. This facilitates synergy between different reasoning outcomes, optimizing the refinement process to uncover adversarial vulnerabilities efficiently.

The paper also highlights the successful integration of GoAT's strategy with widely recognized LLMs like Vicuna and Llama, showcasing substantial improvements in adversarial success while maintaining human-readable and interpretable prompt construction. Through innovative filtering and evaluation mechanisms, GoAT improves efficiency and cost-effectiveness by reducing redundant queries.

Comparison and Evaluation

The paper provides a thorough comparison between GoAT and other prominent adversarial frameworks such as PAIR and TAP. The distinctions made in GoAT, particularly in its graph-based reasoning capabilities and enhanced in-context learning strategies, mark substantial advancements over linear and tree-based methods. Evaluations on the AdvBench dataset affirm GoAT's effectiveness, recording consistent gains in jailbreak success rates across varied target models.

Implications and Future Directions

The implications of GoAT are vast, offering new dimensions for enhancing LLM robustness while providing a versatile toolset for security researchers. By presenting a method that generates interpretable adversarial prompts, GoAT sets a precedent for responsible AI practices that mitigate dual-use risks associated with generative AI technologies.

Future avenues for research could explore extending GoAT's architecture to integrate broader context windows or develop cost-efficient methods to further reduce computational demands. Moreover, establishing provable approaches akin to Differential Privacy for adversarial prompts could enhance the security framework against evolving AI vulnerabilities.

Conclusion

The paper "Graph of Attacks: Improved Black-Box and Interpretable Jailbreaks for LLMs" delivers a sophisticated and systematic approach for adversarial inspection of LLMs using advanced graph-based reasoning techniques. Its methodological contributions offer promising prospects in advancing the robustness and alignment of LLMs amid increasing challenges posed by adversarial exploits. Through careful evaluation and responsible disclosure, this research paves the way for developing safer and more secure AI systems in the future.

Related Papers

GitHub

GitHub - GoAT-pydev/Graph_of_Attacks: A black-box jailbreak framework for LLMs that generates adversarial prompts using graph-based reasoning. GoAT achieves high success rates with fewer queries and produces human-readable outputs.