Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Vocabulary Attack to Hijack Large Language Model Applications (2404.02637v2)

Published 3 Apr 2024 in cs.CR, cs.AI, and cs.DC

Abstract: The fast advancements in LLMs are driving an increasing number of applications. Together with the growing number of users, we also see an increasing number of attackers who try to outsmart these systems. They want the model to reveal confidential information, specific false information, or offensive behavior. To this end, they manipulate their instructions for the LLM by inserting separators or rephrasing them systematically until they reach their goal. Our approach is different. It inserts words from the model vocabulary. We find these words using an optimization procedure and embeddings from another LLM (attacker LLM). We prove our approach by goal hijacking two popular open-source LLMs from the Llama2 and the Flan-T5 families, respectively. We present two main findings. First, our approach creates inconspicuous instructions and therefore it is hard to detect. For many attack cases, we find that even a single word insertion is sufficient. Second, we demonstrate that we can conduct our attack using a different model than the target model to conduct our attack with.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
  1. OpenAI “GPT-4 Technical Report” version 3, 2023 DOI: 10.48550/arXiv.2303.08774
  2. “Scaling Instruction-Finetuned Language Models” version 5 arXiv, 2022 DOI: 10.48550/ARXIV.2210.11416
  3. “LLaMA: Open and Efficient Foundation Language Models” version 1, 2023 arXiv:2302.13971 [cs.CL]
  4. “Llama 2: Open Foundation and Fine-Tuned Chat Models” version 2, 2023 DOI: 10.48550/arXiv.2307.09288
  5. Patrick Sabau and Christoph P. Neumann “Analyse von Methoden zur Sicherung der Vertraulichkeit in Neuronalen Netzen”, 2024 DOI: 10.13140/RG.2.2.21052.65924
  6. “Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study” version 1, 2023 DOI: 10.48550/arXiv.2305.13860
  7. “"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models” version 1, 2023 DOI: 10.48550/arXiv.2308.03825
  8. “Ignore Previous Prompt: Attack Techniques For Language Models” version 1, 2022 DOI: 10.48550/arXiv.2211.09527
  9. “Survey of Hallucination in Natural Language Generation” In ACM Comput. Surv. 55.12 New York, NY, USA: Association for Computing Machinery, 2023, pp. 1–38 DOI: 10.1145/3571730
  10. “Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition” In Empirical Methods in Natural Language Processing, 2023, pp. 4945–4977
  11. Mi Zhang, Xudong Pan and Min Yang “JADE: A Linguistics-based Safety Evaluation Platform for LLM” version 2, 2023 arXiv: 10.48550/arXiv.2311.00286
  12. “Universal Adversarial Triggers for NLP” version 3 In CoRR arXiv:1908.07125, 2019 DOI: 10.48550/arXiv.1908.07125
  13. “Universal and Transferable Adversarial Attacks on Aligned Language Models” version 2 arXiv, 2023 DOI: https://doi.org/10.48550/arXiv.2307.15043
  14. Patrick Levi and Christoph P. Neumann “Vocabulary Attack to Hijack Large Language Model Applications” accepted for publication In Proc of the 15th International Conference on Cloud Computing, GRIDs, and Virtualization (Cloud Computing 2024), 2024
  15. 2024.03.27, https://ai.meta.com/llama/get-started/
  16. “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” In CoRR abs/1910.10683, 2019 DOI: https://doi.org/10.48550/arXiv.1910.10683
Citations (4)

Summary

  • The paper introduces a novel vocabulary-based attack that manipulates LLM outputs through strategic word insertion.
  • The methodology uses an external attacker LLM to optimize word selection, causing models like Llama2 and Flan-T5 to deviate from expected behaviors.
  • Results highlight critical lexical vulnerabilities in LLMs, emphasizing the need for advanced defense mechanisms beyond structural input validation.

Vocabulary Attack to Hijack LLM Applications

Introduction to the Study

Recent developments in LLMs have significantly advanced, leading to their increasing incorporation into various applications. However, this growing adoption has opened up avenues for exploitation, where attackers aim to manipulate these models for malicious purposes such as extracting confidential information, propagating falsehoods, or even prompting offensive outputs. A novel method presented in this paper introduces a vocabulary-based approach to executing such manipulations, notably distinct from conventional attack vectors which primarily rely on structural alterations of instruction inputs. This method centers on the insertion of strategically chosen words from the model's own vocabulary to reroute the application's behavior—a technique substantiated through rigorous testing on the Llama2 and Flan-T5 model families.

Overview of the Methodology

The methodology distinguishes itself by leveraging an external LLM (referred to as the attacker LLM) to identify and insert words into the target LLM's processing queue. The process hinges on an optimization procedure that selects words based on their capacity to alter the model's output towards a desired malicious goal. This approach is primarily evaluated against two core objectives: to incite the target model to generate offensive content or to produce specific false statements. Efficacy is gauged through a comparative analysis, pitting this novel attack against traditional methods, particularly those employing character separators or complete-system prompt modifications.

Experimentation and Results

The experimental framework assesses the attack's performance through a battery of tests designed to challenge the LLMs under various conditions mimicking real-world application scenarios. Utilizing both the Flan-T5-XXL and Llama2-7B models as primary targets, the attack demonstrates varying degrees of success across these platforms. It's noteworthy that, in certain cases, manipulating the prompt with a strategically positioned single word or a minimal sequence of words sufficed to deviate the model's output significantly. This subtlety underscores the method's potential for bypassing conventional detection mechanisms aimed at safeguarding against such adversarial manipulations.

Theoretical and Practical Implications

From a theoretical standpoint, this paper sheds light on the inherent vulnerabilities within LLMs associated with their lexical processing faculties. It further underscores the nuanced understanding of model behavior necessary to fortify against vocabulary-based attacks, a knowledge gap previously unaddressed in existing literature. Practically speaking, the findings offer a groundwork for developing more robust defensive strategies that go beyond mere structural input validation to encompass lexical and contextual vigilance. Furthermore, results indicating the feasibility of launching attacks using a different LLM than the target model hint at broader implications for the security of open-source and proprietary LLMs alike.

Future Directions and Conclusion

The paper, while presenting a compelling proof of concept, paves the way for deeper inquiries into the resilience and adaptability of LLMs against a broader spectrum of adversarial attacks. Investigating the countermeasures capable of thwarting such vocabulary-based manipulations warrants significant attention. Future research could also explore the scalability of the proposed method across diverse LLM architectures and application domains, alongside examining the efficacy of similar attacks in multi-lingual or domain-specific contexts.

In conclusion, this exploration into vocabulary-based attack vectors on LLMs uncovers a critical vulnerability layer within these advanced computational systems. As LLM applications continue to permeate various facets of digital interaction and automation, understanding and mitigating such vulnerabilities become paramount. This paper contributes to the ongoing dialogue in AI security, offering both a novel perspective on model manipulation and a clarion call for comprehensive defensive measures in the age of generative AI.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets