MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks (2409.17699v3)

Published 26 Sep 2024 in cs.CR, cs.AI, and cs.LG

Abstract: The proliferation of LLMs in diverse applications underscores the pressing need for robust security measures to thwart potential jailbreak attacks. These attacks exploit vulnerabilities within LLMs, endanger data integrity and user privacy. Guardrails serve as crucial protective mechanisms against such threats, but existing models often fall short in terms of both detection accuracy, and computational efficiency. This paper advocates for the significance of jailbreak attack prevention on LLMs, and emphasises the role of input guardrails in safeguarding these models. We introduce MoJE (Mixture of Jailbreak Expert), a novel guardrail architecture designed to surpass current limitations in existing state-of-the-art guardrails. By employing simple linguistic statistical techniques, MoJE excels in detecting jailbreak attacks while maintaining minimal computational overhead during model inference. Through rigorous experimentation, MoJE demonstrates superior performance capable of detecting 90% of the attacks without compromising benign prompts, enhancing LLMs security against jailbreak attacks.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces MoJE, a novel modular architecture that uses naive tabular classifiers to detect jailbreak attacks in LLMs.
It employs simple n-gram-based linguistic features and hyperparameter tuning, achieving a 90% detection rate with low false positives.
MoJE offers scalable, efficient updates against evolving attack forms, outperforming state-of-the-art models in key performance metrics.

Analysis of "MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks"

The paper introduces MoJE (Mixture of Jailbreak Experts), a novel guardrail architecture designed to detect and prevent jailbreak attacks on LLMs. Such attacks exploit vulnerabilities in LLMs, compromising data integrity and posing risks to user privacy. The authors propose MoJE as a modular, computationally efficient approach to surpass existing guardrails in performance, specifically focusing on detecting jailbreaks with minimal overhead.

Methodology

MoJE employs a mixture of naive tabular classifiers, each trained on specific jailbreak datasets alongside benign counterparts. This modular structure allows for adaptive incorporation of new attack types by training and integrating new classifiers without requiring substantial computational resources. The authors use simple linguistic statistical techniques, such as n-gram feature extraction, making MoJE both lightweight and efficient.

Experiments demonstrate that MoJE achieves impressive results—detecting 90% of attacks without compromising benign prompts. The architecture's modularity allows for easy updating and expansion, proving resilient against evolving threat landscapes. The paper benchmarks MoJE against state-of-the-art models, including ProtectAI and Llama-Guard, as well as Azure AI Content Safety, showcasing its superior performance in harmful content detection and resilience.

Experimental Setup

For the experimental analysis, various datasets were utilized, encompassing both jailbreak and benign prompts. The use of n-grams as the feature extraction strategy highlights MoJE's ability to work with simple yet effective linguistic statistics. Furthermore, a grid search of hyperparameters optimizes each classifier, and a cross-validation strategy ensures robust model training.

Results

The results indicate that MoJE achieves the highest scores in most performance metrics (AUC, accuracy, $F_{\beta}$ ) compared to both open-weight and closed-source models. Its false positive rate remains low, underscoring its efficacy in distinguishing benign inputs. The MoJE framework demonstrates a marked ability to update seamlessly with new out-of-distribution datasets, unlike many existing models that require full retraining.

Implications and Future Directions

MoJE presents a practical approach to enhancing security measures against jailbreak attacks in LLMs by offering a modular and computationally light solution. This research could have significant implications for deploying more secure LLMs in real-world applications, maintaining user trust and data integrity. Future work could explore integrating MoJE with lightweight LLMs or hybrid approaches combining statistical and deep learning techniques, potentially enabling better handling of complex linguistic prompts and further reducing computational burdens.

In summary, MoJE effectively addresses the pressing need for robust LLM guardrails, offering a scalable and efficient architecture adaptable to new attack forms and evolving threat landscapes. This paper contributes valuable insights into securing LLM applications in a computationally efficient manner.

PDF Markdown

Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1843809595384287608

https://twitter.com/gm8xx8/status/1839490840307069263

https://twitter.com/FSFG/status/1839696420607054292