Vocabulary Attack to Hijack Large Language Model Applications (2404.02637v2)

Published 3 Apr 2024 in cs.CR, cs.AI, and cs.DC

Abstract: The fast advancements in LLMs are driving an increasing number of applications. Together with the growing number of users, we also see an increasing number of attackers who try to outsmart these systems. They want the model to reveal confidential information, specific false information, or offensive behavior. To this end, they manipulate their instructions for the LLM by inserting separators or rephrasing them systematically until they reach their goal. Our approach is different. It inserts words from the model vocabulary. We find these words using an optimization procedure and embeddings from another LLM (attacker LLM). We prove our approach by goal hijacking two popular open-source LLMs from the Llama2 and the Flan-T5 families, respectively. We present two main findings. First, our approach creates inconspicuous instructions and therefore it is hard to detect. For many attack cases, we find that even a single word insertion is sufficient. Second, we demonstrate that we can conduct our attack using a different model than the target model to conduct our attack with.

References (16)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces a novel vocabulary-based attack that manipulates LLM outputs through strategic word insertion.
The methodology uses an external attacker LLM to optimize word selection, causing models like Llama2 and Flan-T5 to deviate from expected behaviors.
Results highlight critical lexical vulnerabilities in LLMs, emphasizing the need for advanced defense mechanisms beyond structural input validation.

Vocabulary Attack to Hijack LLM Applications

Introduction to the Study

Recent developments in LLMs have significantly advanced, leading to their increasing incorporation into various applications. However, this growing adoption has opened up avenues for exploitation, where attackers aim to manipulate these models for malicious purposes such as extracting confidential information, propagating falsehoods, or even prompting offensive outputs. A novel method presented in this paper introduces a vocabulary-based approach to executing such manipulations, notably distinct from conventional attack vectors which primarily rely on structural alterations of instruction inputs. This method centers on the insertion of strategically chosen words from the model's own vocabulary to reroute the application's behavior—a technique substantiated through rigorous testing on the Llama2 and Flan-T5 model families.

Overview of the Methodology

The methodology distinguishes itself by leveraging an external LLM (referred to as the attacker LLM) to identify and insert words into the target LLM's processing queue. The process hinges on an optimization procedure that selects words based on their capacity to alter the model's output towards a desired malicious goal. This approach is primarily evaluated against two core objectives: to incite the target model to generate offensive content or to produce specific false statements. Efficacy is gauged through a comparative analysis, pitting this novel attack against traditional methods, particularly those employing character separators or complete-system prompt modifications.

Experimentation and Results

The experimental framework assesses the attack's performance through a battery of tests designed to challenge the LLMs under various conditions mimicking real-world application scenarios. Utilizing both the Flan-T5-XXL and Llama2-7B models as primary targets, the attack demonstrates varying degrees of success across these platforms. It's noteworthy that, in certain cases, manipulating the prompt with a strategically positioned single word or a minimal sequence of words sufficed to deviate the model's output significantly. This subtlety underscores the method's potential for bypassing conventional detection mechanisms aimed at safeguarding against such adversarial manipulations.

Theoretical and Practical Implications

From a theoretical standpoint, this paper sheds light on the inherent vulnerabilities within LLMs associated with their lexical processing faculties. It further underscores the nuanced understanding of model behavior necessary to fortify against vocabulary-based attacks, a knowledge gap previously unaddressed in existing literature. Practically speaking, the findings offer a groundwork for developing more robust defensive strategies that go beyond mere structural input validation to encompass lexical and contextual vigilance. Furthermore, results indicating the feasibility of launching attacks using a different LLM than the target model hint at broader implications for the security of open-source and proprietary LLMs alike.

Future Directions and Conclusion

The paper, while presenting a compelling proof of concept, paves the way for deeper inquiries into the resilience and adaptability of LLMs against a broader spectrum of adversarial attacks. Investigating the countermeasures capable of thwarting such vocabulary-based manipulations warrants significant attention. Future research could also explore the scalability of the proposed method across diverse LLM architectures and application domains, alongside examining the efficacy of similar attacks in multi-lingual or domain-specific contexts.

In conclusion, this exploration into vocabulary-based attack vectors on LLMs uncovers a critical vulnerability layer within these advanced computational systems. As LLM applications continue to permeate various facets of digital interaction and automation, understanding and mitigating such vulnerabilities become paramount. This paper contributes to the ongoing dialogue in AI security, offering both a novel perspective on model manipulation and a clarion call for comprehensive defensive measures in the age of generative AI.

PDF Markdown