An Interpretable N-gram Perplexity Threat Model for Large Language Model Jailbreaks (2410.16222v2)

Published 21 Oct 2024 in cs.LG, cs.AI, cs.CL, and cs.CR

Abstract: A plethora of jailbreaking attacks have been proposed to obtain harmful responses from safety-tuned LLMs. These methods largely succeed in coercing the target output in their original settings, but their attacks vary substantially in fluency and computational effort. In this work, we propose a unified threat model for the principled comparison of these methods. Our threat model checks if a given jailbreak is likely to occur in the distribution of text. For this, we build an N-gram LLM on 1T tokens, which, unlike model-based perplexity, allows for an LLM-agnostic, nonparametric, and inherently interpretable evaluation. We adapt popular attacks to this threat model, and, for the first time, benchmark these attacks on equal footing with it. After an extensive comparison, we find attack success rates against safety-tuned modern models to be lower than previously presented and that attacks based on discrete optimization significantly outperform recent LLM-based attacks. Being inherently interpretable, our threat model allows for a comprehensive analysis and comparison of jailbreak attacks. We find that effective attacks exploit and abuse infrequent bigrams, either selecting the ones absent from real-world text or rare ones, e.g., specific to Reddit or code datasets.

Citations (1)

View on Semantic Scholar

Summary

The paper establishes a unified threat model for LLM jailbreaks by integrating N-gram perplexity with computational cost constraints.
It recalibrates previous attack success rates, showing that conventional methods drop in effectiveness under standardized evaluation metrics.
The study provides actionable insights for refining adversarial attack strategies and developing robust LLM safety protocols.

A Realistic Threat Model for LLM Jailbreaks

In this work, Boreiko et al. present a methodical investigation into optimizing and evaluating jailbreaking attacks on LLMs, an area necessitating clarity for the development of defenses. The authors propose a unified threat model, which stands out by incorporating two constraints: N-gram model perplexity and computational cost measured by FLOPs. This framework facilitates a comparative analysis of various attacks by normalizing the evaluation metrics across different models.

Key Contributions

The authors prioritize fluency and computational efficiency through a novel, interpretable threat model—specifically employing N-gram based perplexity, which eschews reliance on LLM-based measures for its model-agnostic traits. The perplexity calculates distance from natural text using a lightweight bigram model built from an extensive 1 terabyte token dataset. This enables interpretability, universal evaluation, and robustness against adversarial attacks.

The research also includes an adaptive optimization of popular attacks to fit the proposed threat model. By harmonizing the constraints across perplexity and computation, the paper explores attack effectiveness under consistent conditions. The result is a recalibration of previous success rates, indicating a significant reduction when conventional methods are benchmarked under the prescribed settings.

Empirical Analysis

The empirical section benchmarks various attacks—such as PRS and GCG—against several state-of-the-art LLMs, yielding insightful numerical results. Baseline attacks experience a near-total reduction in success rates under the perplexity constraint, highlighting overestimation in prior studies. However, adaptations in attack strategies to confine them within this threat model reveal that certain methods can still achieve reasonable success, albeit at a constrained efficiency.

Implications and Future Directions

This paper accentuates the imperative to shift the evaluation of adversarial attacks and safety measures in LLMs. The proposed threat model not only provides a foundation for fair comparisons but also uncovers potential algorithmic refinements in attack mechanisms that align with natural language distributions. From an application perspective, integrating such threat models can inform the development of robust LLM safety protocols and safeguard implementations.

Moreover, the interpretability of results through N-gram distributions allows for unique insights into attack strategies, enabling finer-grained analysis of adversarial training across datasets. The observations on bigram distributions and token utilization further the understanding of model vulnerabilities, suggesting possible future research into dataset-centric model training and adversarial robustness testing.

Overall, the work by Boreiko et al. establishes a critical framework that will serve as a cornerstone for subsequent explorations in adversarial threats to LLMs. The nuanced understanding facilitated by this approach calls for continued exploration into scalable and efficient defense mechanisms to enhance the security and reliability of AI systems.

Related Papers

Tweets

https://twitter.com/kotekjedi_ml/status/1848745046431904092