Soft Begging: Modular and Efficient Shielding of LLMs against Prompt Injection and Jailbreaking based on Prompt Tuning (2407.03391v1)
Abstract: Prompt injection (both direct and indirect) and jailbreaking are now recognized as significant issues for LLMs, particularly due to their potential for harm in application-integrated contexts. This extended abstract explores a novel approach to protecting LLMs from such attacks, termed "soft begging." This method involves training soft prompts to counteract the effects of corrupted prompts on the LLM's output. We provide an overview of prompt injections and jailbreaking, introduce the theoretical basis of the "soft begging" technique, and discuss an evaluation of its effectiveness.
- Gabriel Alon and Michael Kamfonas. 2023. Detecting language model attacks with perplexity. arXiv Preprint, arXiv:2308.14132.
- Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650.
- Struq: Defending against prompt injection with structured queries. arXiv:2402.06363.
- Building guardrails for large language models. Preprint, arXiv:2402.01822.
- Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, AISec ’23, page 79–90, New York, NY, USA. Association for Computing Machinery.
- Defending against indirect prompt injection attacks with spotlighting. arXiv:2403.14720.
- Baseline defenses for adversarial attacks against aligned language models. arXiv Preprint, arXiv:2309.00614.
- Scaling down to scale up: A guide to parameter-efficient fine-tuning. Preprint, arXiv:2303.15647.
- Prompt injection attack against llm-integrated applications. arXiv:2306.05499.
- Jailbreaking chatgpt via prompt engineering: An empirical study. Preprint, arXiv:2305.13860.
- Formalizing and benchmarking prompt injection attacks and defenses. arXiv Preprint, arXiv:2310.12815.
- Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models. In NeurIPS ML Safety Workshop.
- Hijacking large language models via adversarial in-context learning. Preprint, arXiv:2311.09948.
- Benchmarking and defending against indirect prompt injection attacks on large language models. arXiv:2312.14197.
- Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. arXiv:2403.02691.
- Universal and transferable adversarial attacks on aligned language models. Preprint, arXiv:2307.15043.
- Can llms separate instructions from data? and what do we even mean by that? Preprint, arXiv:2403.06833.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.