PRSA: PRompt Stealing Attacks against Large Language Models (2402.19200v2)
Abstract: In recent years, "prompt as a service" has greatly enhanced the utility of LLMs by enabling them to perform various downstream tasks efficiently without fine-tuning. This has also increased the commercial value of prompts. However, the potential risk of leakage in these commercialized prompts remains largely underexplored. In this paper, we introduce a novel attack framework, PRSA, designed for prompt stealing attacks against LLMs. The main idea of PRSA is to infer the intent behind a prompt by analyzing its input-output content, enabling the generation of a surrogate prompt that replicates the original's functionality. Specifically, PRSA mainly consists of two key phases: prompt mutation and prompt pruning. In the mutation phase, we propose a prompt attention algorithm based on output difference. The algorithm facilitates the generation of effective surrogate prompts by learning key factors that influence the accurate inference of prompt intent. During the pruning phase, we employ a two-step related word identification strategy to detect and mask words that are highly related to the input, thus improving the generalizability of the surrogate prompts. We verify the actual threat of PRSA through evaluation in both real-world settings, non-interactive and interactive prompt services. The results strongly confirm the PRSA's effectiveness and generalizability. We have reported these findings to prompt service providers and actively collaborate with them to implement defensive measures.
- Prompti AI. Prompt Marketplace. https://prompti.ai/chatgpt-prompt/.
- Text analysis. Handbook of methods in cultural anthropology, 613, 1998.
- Conversation level syntax similarity metric. Behavior research methods, 50:1055–1073, 2018.
- Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
- Equations for part-of-speech tagging. In AAAI, volume 11, pages 784–789. Citeseer, 1993.
- Instructzero: Efficient instruction optimization for black-box large language models. arXiv preprint arXiv:2306.03082, 2023.
- Fastkassim: A fast tree kernel-based syntactic similarity metric. arXiv preprint arXiv:2203.08299, 2022.
- Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715, 2023.
- Godofprompt.ai. AI Prompt Generator. https://www.godofprompt.ai/.
- Google. PaLM. https://ai.google/discover/palm2.
- GPTsdex. GPTStore. https://gptsdex.com/.
- Gretel. Measure the utility and quality of gpt-generated text using gretel’s new text report. https://gretel.ai/blog/synthetic-text-data-quality-report.
- Large language models for code: Security hardening and adversarial testing. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 1865–1879, 2023.
- High accuracy and high fidelity extraction of neural networks. In 29th USENIX security symposium (USENIX Security 20), pages 1345–1362, 2020.
- Challenges and applications of large language models. arXiv preprint arXiv:2307.10169, 2023.
- Lmcanvas: Object-oriented interaction to personalize large language model-powered writing environments, 2023.
- Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360, 2016.
- On extracting specialized code abilities from large language models: A feasibility study, 2023.
- Nypost. This hot new tech job pays $300K a year. https://nypost.com/2023/12/19/lifestyle/ai-prompt-engineers-in-demand-what-to-know-about-this-new-job/.
- Towards reverse-engineering black-box neural networks. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, pages 121–144, 2019.
- OpenAI. GPTs. https://openai.com/blog/introducing-gpts.
- OpenAI. OpenAI Product. https://openai.com/product.
- OpenGPT. LLM Service Platform. https://open-gpt.app/.
- Knockoff nets: Stealing functionality of black-box models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4954–4963, 2019.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
- Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022.
- PromptBase. Prompt Marketplace. https://promptbase.com.
- Automatic prompt optimization with" gradient descent" and beam search. arXiv preprint arXiv:2305.03495, 2023.
- Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156, 2023.
- " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. arXiv preprint arXiv:2308.03825, 2023.
- Model stealing attacks against inductive graph neural networks. In 2022 IEEE Symposium on Security and Privacy (SP), pages 1175–1192. IEEE, 2022.
- Prompt-and-rerank: A method for zero-shot and few-shot arbitrary textual style transfer with small language models. arXiv preprint arXiv:2205.11503, 2022.
- Performance evaluation of similarity measures on similar and dissimilar text retrieval. In 2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K), volume 1, pages 577–584. IEEE, 2015.
- Llama: Open and efficient foundation language models, 2023.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Stealing hyperparameters in machine learning. In 2018 IEEE symposium on security and privacy (SP), pages 36–52. IEEE, 2018.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
- Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
- Hard non-monotonic attention for character-level transduction. arXiv preprint arXiv:1808.10024, 2018.
- Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
- Promptcare: Prompt copyright protection by watermark injection and verification. arXiv preprint arXiv:2308.02816, 2023.
- Prompts should not be seen as secrets: Systematically measuring prompt extraction attack success. arXiv preprint arXiv:2307.06865, 2023.
- Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910, 2022.
- Yong Yang (237 papers)
- Xuhong Zhang (61 papers)
- Yi Jiang (171 papers)
- Xi Chen (1035 papers)
- Haoyu Wang (309 papers)
- Shouling Ji (136 papers)
- Zonghui Wang (7 papers)
- Changjiang Li (22 papers)