Model Leeching: An Extraction Attack Targeting LLMs (2309.10544v1)
Abstract: Model Leeching is a novel extraction attack targeting LLMs, capable of distilling task-specific knowledge from a target LLM into a reduced parameter model. We demonstrate the effectiveness of our attack by extracting task capability from ChatGPT-3.5-Turbo, achieving 73% Exact Match (EM) similarity, and SQuAD EM and F1 accuracy scores of 75% and 87%, respectively for only $50 in API cost. We further demonstrate the feasibility of adversarial attack transferability from an extracted model extracted via Model Leeching to perform ML attack staging against a target LLM, resulting in an 11% increase to attack success rate when applied to ChatGPT-3.5-Turbo.
- AI, G. About Bard. Google AI: Publications, 2023. Accessed: 8th February 2023.
- AWS. Sagemaker data labeling pricing. https://aws.amazon.com/sagemaker/data-labeling/pricing/, 2023. Accessed: 20230-06-30.
- Extracting training data from large language models, 2021.
- Adversarial attacks and defences: A survey, 2018.
- Machine generated text: A comprehensive survey of threat models and detection methods, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
- The turking test: Can language models understand instructions?, 2020.
- Floridi, L. Ai as agency without intelligence: on chatgpt, large language models, and other generative models. Philosophy & Technology 36, 1 (Mar 2023), 15.
- Pinch: An adversarial extraction attack framework for deep learning models, 2023.
- Deepsniffer: A dnn model extraction framework based on learning architectural hints. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (New York, NY, USA, 2020), ASPLOS ’20, Association for Computing Machinery, p. 385–399.
- Adversarial examples for evaluating reading comprehension systems, 2017.
- Thieves on sesame street! model extraction of bert-based apis, 2020.
- Albert: A lite bert for self-supervised learning of language representations, 2020.
- Roberta: A robustly optimized bert pretraining approach, 2019.
- Reframing instructional prompts to GPTk’s language. In Findings of the Association for Computational Linguistics: ACL 2022 (Dublin, Ireland, May 2022), Association for Computational Linguistics, pp. 589–612.
- MITRE. MITRE ATLAS Adversarial Attack Knowledge Base, 2023. [Online; accessed 02-May-2023].
- I know what you trained last summer: A survey on stealing machine learning models and defences. ACM Comput. Surv. 55, 14s (jul 2023).
- OpenAI. ChatGPT. OpenAI Blog, 2023. Accessed: 2023-02-08.
- OpenAI. gpt4all.io, 2023. Accessed: 8th February 2023.
- The limitations of deep learning in adversarial settings. pp. 372–387.
- Squad: 100,000+ questions for machine comprehension of text, 2016.
- Membership inference attacks against machine learning models, 2017.
- Dawn: Dynamic adversarial watermarking of neural networks, 2021.
- Llama: Open and efficient foundation language models, 2023.
- Stealing machine learning models via prediction APIs. In 25th USENIX Security Symposium (USENIX Security 16) (Austin, TX, Aug. 2016), USENIX Association, pp. 601–618.
- Attention is all you need, 2017.
- The security of machine learning in an adversarial setting: A survey. Journal of Parallel and Distributed Computing 130 (2019), 12–23.
- Self-instruct: Aligning language model with self generated instructions, 2022.
- A prompt pattern catalog to enhance prompt engineering with chatgpt, 2023.
- A survey of large language models, 2023.
- Universal and transferable adversarial attacks on aligned language models, 2023.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.