Prompt Leakage effect and defense strategies for multi-turn LLM interactions (2404.16251v3)
Abstract: Prompt leakage poses a compelling security and privacy threat in LLM applications. Leakage of system prompts may compromise intellectual property, and act as adversarial reconnaissance for an attacker. A systematic evaluation of prompt leakage threats and mitigation strategies is lacking, especially for multi-turn LLM interactions. In this paper, we systematically investigate LLM vulnerabilities against prompt leakage for 10 closed- and open-source LLMs, across four domains. We design a unique threat model which leverages the LLM sycophancy effect and elevates the average attack success rate (ASR) from 17.7% to 86.2% in a multi-turn setting. Our standardized setup further allows dissecting leakage of specific prompt contents such as task instructions and knowledge documents. We measure the mitigation effect of 7 black-box defense strategies, along with finetuning an open-source model to defend against leakage attempts. We present different combination of defenses against our threat model, including a cost analysis. Our study highlights key takeaways for building secure LLM applications and provides directions for research in multi-turn LLM interactions
- Bioasq: A challenge on large-scale biomedical semantic indexing and question answering. In Multimodal Retrieval in the Medical Domain: First International Workshop, MRMD 2015, Vienna, Austria, March 29, 2015, Revised Selected Papers, pp. 26–39. Springer, 2015.
- Extracting training data from large language models, 2021.
- Fnspid: A comprehensive financial news dataset in time series, 2024.
- Coercing llms to do and reveal (almost) anything, 2024.
- Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection, 2023.
- Are large pre-trained language models leaking your personal information? arXiv preprint arXiv:2205.12628, 2022.
- Preventing generation of verbatim memorization in language models gives a false sense of privacy. In C. Maria Keet, Hung-Yi Lee, and Sina Zarrieß (eds.), Proceedings of the 16th International Natural Language Generation Conference, pp. 28–53, Prague, Czechia, September 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.inlg-main.3. URL https://aclanthology.org/2023.inlg-main.3.
- Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146, 2019.
- Propile: Probing privacy leakage in large language models. Advances in Neural Information Processing Systems, 36, 2024.
- Billsum: A corpus for automatic summarization of us legislation. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pp. 48–56, 2019.
- Are you sure? challenging llms leads to performance drops in the flipflop experiment. arXiv preprint arXiv:2311.08596, 2023.
- Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197, 2023.
- Query rewriting via large language models, 2024.
- Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499, 2023.
- Query rewriting for retrieval-augmented large language models, 2023.
- OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
- Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022.
- Follow my instruction and spill the beans: Scalable data extraction from retrieval-augmented generation systems, 2024.
- In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316–1331, 2023.
- Prompt stealing attacks against large language models, 2024.
- Towards understanding sycophancy in language models. ArXiv, abs/2310.13548, 2023. URL https://api.semanticscholar.org/CorpusID:264405698.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- The art of defending: A systematic evaluation and analysis of llm defense strategies on safety and over-defensiveness. arXiv preprint arXiv:2401.00287, 2023.
- A new era in llm security: Exploring security concerns in real-world llm-based systems, 2024.
- Llm jailbreak attack versus defense techniques – a comprehensive study, 2024.
- Prsa: Prompt reverse stealing attacks against large language models, 2024.
- Benchmarking and defending against indirect prompt injection attacks on large language models, 2023.
- The good and the bad: Exploring privacy issues in retrieval-augmented generation (rag), 2024.
- Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents, 2024.
- Effective prompt extraction from language models, 2024a.
- Intention analysis prompting makes large language models a good jailbreak defender. arXiv preprint arXiv:2401.06561, 2024b.
- Defending large language models against jailbreaking attacks through goal prioritization, 2023.
- Autodan: Interpretable gradient-based adversarial attacks on large language models, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.