Language Model Inversion (2311.13647v1)
Abstract: LLMs produce a distribution over the next token; can we use this information to recover the prompt tokens? We consider the problem of LLM inversion and show that next-token probabilities contain a surprising amount of information about the preceding text. Often we can recover the text in cases where it is hidden from the user, motivating a method for recovering unknown prompts given only the model's current distribution output. We consider a variety of model access scenarios, and show how even without predictions for every token in the vocabulary we can recover the probability vector through search. On Llama-2 7b, our inversion method reconstructs prompts with a BLEU of $59$ and token-level F1 of $78$ and recovers $27\%$ of prompts exactly. Code for reproducing all experiments is available at http://github.com/jxmorris12/vec2text.
- Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
- High fidelity visualization of what your self-supervised representation knows about. Trans. Mach. Learn. Res., 2022, 2021.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Inverting visual representations with convolutional networks, 2016.
- On the privacy risk of in-context learning. In ACL 2023 Workshop on Trustworthy Natural Language Processing, 2023.
- Feature-wise transformations. Distill, 2018. doi: 10.23915/distill.00011. https://distill.pub/2018/feature-wise-transformations.
- Sentence embedding encoders are easy to steal but hard to defend. In ICLR 2023 Workshop on Pitfalls of limited data and computation for Trustworthy ML, 2023. URL https://openreview.net/forum?id=XN5qOxI8gkz.
- Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, CCS ’15, pp. 1322–1333, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450338325. doi: 10.1145/2810103.2813677. URL https://doi.org/10.1145/2810103.2813677.
- Privacy in pharmacogenetics: An end-to-end case study of personalized warfarin dosing. In Proceedings of the USENIX Security Symposium, pp. 17–32, August 2014.
- The false promise of imitating proprietary llms, 2023.
- The curious case of neural text degeneration, 2020.
- Unnatural instructions: Tuning language models with (almost) no human labor. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14409–14428, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.806. URL https://aclanthology.org/2023.acl-long.806.
- Thieves on sesame street! model extraction of bert-based apis, 2020.
- Neural Text Generation from Structured Data with Application to the Biography Domain . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.
- Sentence embedding leaks more information than you expect: Generative embedding inversion attack to recover the whole sentence, 2023.
- Adversarial learning. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD ’05, pp. 641–647, New York, NY, USA, 2005. Association for Computing Machinery. ISBN 159593135X. doi: 10.1145/1081870.1081950. URL https://doi.org/10.1145/1081870.1081950.
- Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
- Understanding deep image representations by inverting them. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5188–5196, 2014.
- Pointer sentinel mixture models, 2016.
- Text embeddings reveal (almost) as much as text, 2023.
- Text and code embeddings by contrastive pre-training, 2022.
- A framework for the extraction of deep neural networks by leveraging public data. ArXiv, abs/1905.09165, 2019. URL https://api.semanticscholar.org/CorpusID:162168576.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://aclanthology.org/P02-1040.
- Language models are unsupervised multitask learners. 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer, 2020.
- Membership inference attacks against machine learning models, 2017.
- Information leakage in embedding models, 2020.
- Breaching fedmd: Image recovery via paired-logits inversion attack, 2023.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- Understanding invariance via feedforward inversion of discriminatively trained classifiers, 2021.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Stealing machine learning models via prediction apis, 2016.
- Imitation attacks and defenses for black-box machine translation systems, 2021.
- Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 5085–5109, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.340. URL https://aclanthology.org/2022.emnlp-main.340.
- Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13484–13508, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.754. URL https://aclanthology.org/2023.acl-long.754.
- Lamini-lm: A diverse herd of distilled models from large-scale instructions. CoRR, abs/2304.14402, 2023. URL https://arxiv.org/abs/2304.14402.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
- Synthbio: A case study in human-ai collaborative curation of text datasets, 2022.
- Text revealer: Private text reconstruction via model inversion attacks against transformers, 2022.
- The secret revealer: Generative model-inversion attacks against deep neural networks, 2020.