Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 102 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 30 tok/s
GPT-5 High 27 tok/s Pro
GPT-4o 110 tok/s
GPT OSS 120B 475 tok/s Pro
Kimi K2 203 tok/s Pro
2000 character limit reached

Soft Begging: Modular and Efficient Shielding of LLMs against Prompt Injection and Jailbreaking based on Prompt Tuning (2407.03391v1)

Published 3 Jul 2024 in cs.CR, cs.AI, and cs.CL

Abstract: Prompt injection (both direct and indirect) and jailbreaking are now recognized as significant issues for LLMs, particularly due to their potential for harm in application-integrated contexts. This extended abstract explores a novel approach to protecting LLMs from such attacks, termed "soft begging." This method involves training soft prompts to counteract the effects of corrupted prompts on the LLM's output. We provide an overview of prompt injections and jailbreaking, introduce the theoretical basis of the "soft begging" technique, and discuss an evaluation of its effectiveness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
  1. Gabriel Alon and Michael Kamfonas. 2023. Detecting language model attacks with perplexity. arXiv Preprint, arXiv:2308.14132.
  2. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pages 2633–2650.
  3. Struq: Defending against prompt injection with structured queries. arXiv:2402.06363.
  4. Building guardrails for large language models. Preprint, arXiv:2402.01822.
  5. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, AISec ’23, page 79–90, New York, NY, USA. Association for Computing Machinery.
  6. Defending against indirect prompt injection attacks with spotlighting. arXiv:2403.14720.
  7. Baseline defenses for adversarial attacks against aligned language models. arXiv Preprint, arXiv:2309.00614.
  8. Scaling down to scale up: A guide to parameter-efficient fine-tuning. Preprint, arXiv:2303.15647.
  9. Prompt injection attack against llm-integrated applications. arXiv:2306.05499.
  10. Jailbreaking chatgpt via prompt engineering: An empirical study. Preprint, arXiv:2305.13860.
  11. Formalizing and benchmarking prompt injection attacks and defenses. arXiv Preprint, arXiv:2310.12815.
  12. Fábio Perez and Ian Ribeiro. 2022. Ignore previous prompt: Attack techniques for language models. In NeurIPS ML Safety Workshop.
  13. Hijacking large language models via adversarial in-context learning. Preprint, arXiv:2311.09948.
  14. Benchmarking and defending against indirect prompt injection attacks on large language models. arXiv:2312.14197.
  15. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. arXiv:2403.02691.
  16. Universal and transferable adversarial attacks on aligned language models. Preprint, arXiv:2307.15043.
  17. Can llms separate instructions from data? and what do we even mean by that? Preprint, arXiv:2403.06833.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube