garak: A Framework for Security Probing Large Language Models (2406.11036v1)

Published 16 Jun 2024 in cs.CL and cs.CR

Abstract: As LLMs are deployed and integrated into thousands of applications, the need for scalable evaluation of how models respond to adversarial attacks grows rapidly. However, LLM security is a moving target: models produce unpredictable output, are constantly updated, and the potential adversary is highly diverse: anyone with access to the internet and a decent command of natural language. Further, what constitutes a security weak in one context may not be an issue in a different context; one-fits-all guardrails remain theoretical. In this paper, we argue that it is time to rethink what constitutes ``LLM security'', and pursue a holistic approach to LLM security evaluation, where exploration and discovery of issues are central. To this end, this paper introduces garak (Generative AI Red-teaming and Assessment Kit), a framework which can be used to discover and identify vulnerabilities in a target LLM or dialog system. garak probes an LLM in a structured fashion to discover potential vulnerabilities. The outputs of the framework describe a target model's weaknesses, contribute to an informed discussion of what composes vulnerabilities in unique contexts, and can inform alignment and policy discussions for LLM deployment.

PDF HTML Abstract

A Framework for Security Probing LLMs

Introduction

The deployment and integration of LLMs into various applications have introduced a need for scalable evaluation methods to identify and mitigate security vulnerabilities. In the domain of LLMs, security concerns are unique due to the unpredictable nature of outputs and the diverse range of potential adversaries. Classical cybersecurity methods often fall short, as attacks on LLMs are predominantly linguistic rather than cryptographic. This paper introduces the Generative AI Red-teaming and Assessment Kit (GARAK), a comprehensive framework for probing LLMs to discover and elucidate security vulnerabilities systematically.

Framework Architecture

GARAK is structured around four main components: Generators, Probes, Detectors, and Buffs. Each component plays a crucial role in the framework's ability to assess LLM vulnerabilities.

Generators: These are abstractions of any text-generating software, from APIs to LLMs. GARAK natively supports various platforms, making it versatile for testing a range of LLMs without extensive additional development.
Probes: These are designed to elicit specific vulnerabilities in LLMs. GARAK includes probes that test for false claims, replay of training data, malware generation, invisible tags, and more. Each probe targets distinct failure modes, from direct prompt injection to more sophisticated attacks like AutoDAN and GCG.
Detectors: Given the diversity of potential LLM outputs, detection of failures requires robust methods. GARAK employs both keyword-based detection and machine-learning classifiers to identify issues such as toxicity and misleading claims.
Buffs: These augment or perturb interactions between probes and generators, analogous to fuzzing in cybersecurity. Buffs modify input or model hyperparameters to elicit potential security vulnerabilities, thus expanding the probing space.

Holistic Security Evaluation

The paper argues the necessity of a holistic and structured approach to LLM security evaluation. Traditional benchmarking is insufficient due to rapidly evolving attack strategies and continuous model updates. Instead, a dynamic and exploratory method like GARAK's framework provides a broader understanding of vulnerabilities, aiding in policy formation and alignment.

Strong Numerical Results

One salient feature of GARAK is the attack generation module (atkgen), which employs a conversational red-teaming model to train adversarial probes dynamically. Using the Anthropic HHRLHF dataset and various LLMs, GARAK's atkgen probes demonstrated high efficacy in eliciting toxic outputs across multiple models. For instance, the toxicity rates observed during evaluation were as follows:

GPT-2: 17.0%
GPT-3: 10.5%
GPT-4: 2.9%
OPT 6.7B: 26.7%
Vicuna: 3.8%
Wizard Uncensored: 5.7%

These results underscore the utility of GARAK in exposing critical weaknesses that static datasets and traditional testing methods might miss.

Practical and Theoretical Implications

The practical implications of this research are substantial. Security practitioners can leverage GARAK to automate the discovery of LLM vulnerabilities, making the assessment process more accessible and less time-consuming. The framework's extensibility ensures it remains relevant, accommodating new attack methods and evolving model architectures. Theoretically, GARAK sets a precedent for approaching LLM security with a cybersecurity mindset, integrating principles like red teaming into the fabric of LLM evaluation.

Future Developments

The continuous adaptation of GARAK is essential to keep pace with advancements in AI. Future developments may include enhancing the sophistication of buffs, incorporating new linguistic attack vectors, and refining machine-learning classifiers for failure detection. Collaboration with broader security communities can further enrich the framework, fostering a more secure and reliable deployment of LLMs.

Conclusion

GARAK represents a significant advancement in the structured, automated evaluation of LLM security. By focusing on holistic probing and dynamic attack generation, the framework bridges gaps left by traditional methods, offering a robust tool for both researchers and practitioners. As LLMs continue to evolve, frameworks like GARAK will be indispensable in ensuring their safe and secure integration into real-world applications.

## References

- Ganguli, D., et al. 2022. "Red Teaming LLMs with LLMs."

- Derczynski, L., et al. "A Framework for Security Probing LLMs."

- "Top 10 for LLMs." OWASP.

- "NMAP Network Scanning." NMAP.

- Anthropic HHRLHF Dataset. 2022.

- Zou, X., et al. 2023. "GCG: Greedy Coordinate Gradient Method for Prompt Injection."

- Shen, J., et al. 2023. "DAN: Do Anything Now Prompts for LLMs."

- Liu, C., et al. 2023. "AutoDAN: Automated Jailbreaking of LLMs using Genetic Algorithms."

- Perez, E., et al. 2022. "Prompt Injection Techniques and Mitigations."

- Gehman, S., et al. 2020. "RealToxicityPrompts: A Benchmark Dataset for LLM Safety Testing."

- Vassilev, A., et al. 2024. "NIST Adversarial Machine Learning Taxonomy."

- Anderson, R. 2020. "Security Engineering: A Guide to Building Dependable Distributed Systems."

- Wallace, E., et al. 2024. "Instruction Overriding in LLM Applications."

- Raji, I. D., et al. 2021. "AI Benchmarking and Its Discontents."

- Inie, N., et al. 2023. "Professional Red Teaming in AI: Practices and Challenges."