Prompt Stealing Attacks Against Large Language Models (2402.12959v1)
Abstract: The increasing reliance on LLMs such as ChatGPT in various fields emphasizes the importance of ``prompt engineering,'' a technology to improve the quality of model outputs. With companies investing significantly in expert prompt engineers and educational resources rising to meet market demand, designing high-quality prompts has become an intriguing challenge. In this paper, we propose a novel attack against LLMs, named prompt stealing attacks. Our proposed prompt stealing attack aims to steal these well-designed prompts based on the generated answers. The prompt stealing attack contains two primary modules: the parameter extractor and the prompt reconstruction. The goal of the parameter extractor is to figure out the properties of the original prompts. We first observe that most prompts fall into one of three categories: direct prompt, role-based prompt, and in-context prompt. Our parameter extractor first tries to distinguish the type of prompts based on the generated answers. Then, it can further predict which role or how many contexts are used based on the types of prompts. Following the parameter extractor, the prompt reconstructor can be used to reconstruct the original prompts based on the generated answers and the extracted features. The final goal of the prompt reconstructor is to generate the reversed prompts, which are similar to the original prompts. Our experimental results show the remarkable performance of our proposed attacks. Our proposed attacks add a new dimension to the study of prompt engineering and call for more attention to the security issues on LLMs.
- https://chat.openai.com/chat.
- Generating Synthetic Documents for Cross-Encoder Re-Rankers: A Comparative Study of ChatGPT and Human Experts. CoRR abs/2305.02320, 2023.
- Language Models are Few-Shot Learners. In Annual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS, 2020.
- Low-code LLM: Visual Programming over LLMs. CoRR abs/2304.08103, 2023.
- Deep Reinforcement Learning from Human Preferences. In Annual Conference on Neural Information Processing Systems (NIPS), pages 4299–4307. NIPS, 2017.
- Jailbreaker: Automated Jailbreak Across Multiple Large Language Model Chatbots. CoRR abs/2307.08715, 2023.
- Toxicity in ChatGPT: Analyzing Persona-assigned Language Models. CoRR abs/2304.05335, 2023.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 4171–4186. ACL, 2019.
- Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling. CoRR abs/2107.03451, 2021.
- Queens are Powerful too: Mitigating Gender Bias in Dialogue Generation. In Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8173–8188. ACL, 2020.
- Simulating H.P. Lovecraft horror literature with the ChatGPT large language model. CoRR abs/2305.03429, 2023.
- How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection. CoRR abs/2301.07597, 2023.
- MGTBench: Benchmarking Machine-Generated Text Detection. CoRR abs/2303.14822, 2023.
- Membership Inference Attacks on Machine Learning: A Survey. ACM Computing Surveys, 2021.
- Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733, 2023.
- Multi-step Jailbreaking Privacy Attacks on ChatGPT. CoRR abs/2304.05197, 2023.
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. CoRR abs/2201.12086, 2022.
- Holistic Evaluation of Language Models. CoRR abs/2211.09110, 2022.
- Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study. CoRR abs/2305.13860, 2023.
- Exploiting Unintended Feature Leakage in Collaborative Learning. In IEEE Symposium on Security and Privacy (S&P), pages 497–512. IEEE, 2019.
- DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature. CoRR abs/2301.11305, 2023.
- ClipCap: CLIP Prefix for Image Captioning. CoRR abs/2111.09734, 2021.
- OpenAI. GPT-4 Technical Report. CoRR abs/2303.08774, 2023.
- Generative Agents: Interactive Simulacra of Human Behavior. CoRR abs/2304.03442, 2023.
- Instruction Tuning with GPT-4. CoRR abs/2304.03277, 2023.
- Red Teaming Language Models with Language Models. CoRR abs/2202.03286, 2022.
- Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021.
- Improving language understanding by generative pre-training. 2018.
- Language Models are Unsupervised Multitask Learners. OpenAI blog, 2019.
- Hierarchical Text-Conditional Image Generation with CLIP Latents. CoRR abs/2204.06125, 2022.
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3980–3990. ACL, 2019.
- Recipes for Building an Open-Domain Chatbot. In Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 300–325. ACL, 2021.
- High-Resolution Image Synthesis with Latent Diffusion Models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695. IEEE, 2022.
- In ChatGPT We Trust? Measuring and Characterizing the Reliability of ChatGPT. CoRR abs/2304.08979, 2023.
- Prompt Stealing Attacks Against Text-to-Image Generation Models. CoRR abs/2302.09923, 2023.
- Membership Inference Attacks Against Machine Learning Models. In IEEE Symposium on Security and Privacy (S&P), pages 3–18. IEEE, 2017.
- The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational Agents. In Annual Meeting of the Association for Computational Linguistics (ACL), pages 2453–2470. ACL, 2020.
- Why So Toxic? Measuring and Triggering Toxic Behavior in Open-Domain Chatbots. In ACM SIGSAC Conference on Computer and Communications Security (CCS), pages 2659–2673. ACM, 2022.
- An Analysis of the Automatic Bug Fixing Performance of ChatGPT. CoRR abs/2301.08653, 2023.
- Overlearning Reveals Sensitive Attributes. In International Conference on Learning Representations (ICLR), 2020.
- Learning to summarize from human feedback. CoRR abs/2009.01325, 2020.
- The Role of AI in Human-AI Creative Writing for Hong Kong Secondary Students. CoRR abs/2304.11276, 2023.
- LLaMA: Open and Efficient Foundation Language Models. CoRR abs/2302.13971, 2023.
- Attention is All you Need. In Annual Conference on Neural Information Processing Systems (NIPS), pages 5998–6008. NIPS, 2017.
- Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models. In Annual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS, 2021.
- ChatCAD: Interactive Computer-Aided Diagnosis on Medical Image using Large Language Models. CoRR abs/2302.07257, 2023.
- Self-Consistency Improves Chain of Thought Reasoning in Language Models. In International Conference on Learning Representations (ICLR), 2023.
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Annual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS, 2022.
- Leveraging Large Language Models to Power Chatbots for Collecting User Self-Reported Data. CoRR abs/2301.05843, 2023.
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models. CoRR abs/2305.10601, 2023.
- STaR: Bootstrapping Reasoning With Reasoning. In Annual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS, 2022.
- Cash Transaction Booking via Retrieval Augmented LLM. In ACM Conference on Knowledge Discovery and Data Mining (KDD). ACM, 2023.
- Prompts Should not be Seen as Secrets: Systematically Measuring Prompt Extraction Attack Success. CoRR abs/2307.06865, 2023.
- DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation. CoRR abs/1911.00536, 2019.
- Zeyang Sha (11 papers)
- Yang Zhang (1129 papers)