Prompt Leakage effect and defense strategies for multi-turn LLM interactions (2404.16251v3)
Abstract: Prompt leakage poses a compelling security and privacy threat in LLM applications. Leakage of system prompts may compromise intellectual property, and act as adversarial reconnaissance for an attacker. A systematic evaluation of prompt leakage threats and mitigation strategies is lacking, especially for multi-turn LLM interactions. In this paper, we systematically investigate LLM vulnerabilities against prompt leakage for 10 closed- and open-source LLMs, across four domains. We design a unique threat model which leverages the LLM sycophancy effect and elevates the average attack success rate (ASR) from 17.7% to 86.2% in a multi-turn setting. Our standardized setup further allows dissecting leakage of specific prompt contents such as task instructions and knowledge documents. We measure the mitigation effect of 7 black-box defense strategies, along with finetuning an open-source model to defend against leakage attempts. We present different combination of defenses against our threat model, including a cost analysis. Our study highlights key takeaways for building secure LLM applications and provides directions for research in multi-turn LLM interactions
- Bioasq: A challenge on large-scale biomedical semantic indexing and question answering. In Multimodal Retrieval in the Medical Domain: First International Workshop, MRMD 2015, Vienna, Austria, March 29, 2015, Revised Selected Papers, pp. 26–39. Springer, 2015.
- Extracting training data from large language models, 2021.
- Fnspid: A comprehensive financial news dataset in time series, 2024.
- Coercing llms to do and reveal (almost) anything, 2024.
- Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection, 2023.
- Are large pre-trained language models leaking your personal information? arXiv preprint arXiv:2205.12628, 2022.
- Preventing generation of verbatim memorization in language models gives a false sense of privacy. In C. Maria Keet, Hung-Yi Lee, and Sina Zarrieß (eds.), Proceedings of the 16th International Natural Language Generation Conference, pp. 28–53, Prague, Czechia, September 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.inlg-main.3. URL https://aclanthology.org/2023.inlg-main.3.
- Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146, 2019.
- Propile: Probing privacy leakage in large language models. Advances in Neural Information Processing Systems, 36, 2024.
- Billsum: A corpus for automatic summarization of us legislation. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pp. 48–56, 2019.
- Are you sure? challenging llms leads to performance drops in the flipflop experiment. arXiv preprint arXiv:2311.08596, 2023.
- Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197, 2023.
- Query rewriting via large language models, 2024.
- Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499, 2023.
- Query rewriting for retrieval-augmented large language models, 2023.
- OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
- Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022.
- Follow my instruction and spill the beans: Scalable data extraction from retrieval-augmented generation systems, 2024.
- In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316–1331, 2023.
- Prompt stealing attacks against large language models, 2024.
- Towards understanding sycophancy in language models. ArXiv, abs/2310.13548, 2023. URL https://api.semanticscholar.org/CorpusID:264405698.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- The art of defending: A systematic evaluation and analysis of llm defense strategies on safety and over-defensiveness. arXiv preprint arXiv:2401.00287, 2023.
- A new era in llm security: Exploring security concerns in real-world llm-based systems, 2024.
- Llm jailbreak attack versus defense techniques – a comprehensive study, 2024.
- Prsa: Prompt reverse stealing attacks against large language models, 2024.
- Benchmarking and defending against indirect prompt injection attacks on large language models, 2023.
- The good and the bad: Exploring privacy issues in retrieval-augmented generation (rag), 2024.
- Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents, 2024.
- Effective prompt extraction from language models, 2024a.
- Intention analysis prompting makes large language models a good jailbreak defender. arXiv preprint arXiv:2401.06561, 2024b.
- Defending large language models against jailbreaking attacks through goal prioritization, 2023.
- Autodan: Interpretable gradient-based adversarial attacks on large language models, 2023.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.