Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Language Model Inversion (2311.13647v1)

Published 22 Nov 2023 in cs.CL and cs.LG

Abstract: LLMs produce a distribution over the next token; can we use this information to recover the prompt tokens? We consider the problem of LLM inversion and show that next-token probabilities contain a surprising amount of information about the preceding text. Often we can recover the text in cases where it is hidden from the user, motivating a method for recovering unknown prompts given only the model's current distribution output. We consider a variety of model access scenarios, and show how even without predictions for every token in the vocabulary we can recover the probability vector through search. On Llama-2 7b, our inversion method reconstructs prompts with a BLEU of $59$ and token-level F1 of $78$ and recovers $27\%$ of prompts exactly. Code for reproducing all experiments is available at http://github.com/jxmorris12/vec2text.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
  2. High fidelity visualization of what your self-supervised representation knows about. Trans. Mach. Learn. Res., 2022, 2021.
  3. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  4. Inverting visual representations with convolutional networks, 2016.
  5. On the privacy risk of in-context learning. In ACL 2023 Workshop on Trustworthy Natural Language Processing, 2023.
  6. Feature-wise transformations. Distill, 2018. doi: 10.23915/distill.00011. https://distill.pub/2018/feature-wise-transformations.
  7. Sentence embedding encoders are easy to steal but hard to defend. In ICLR 2023 Workshop on Pitfalls of limited data and computation for Trustworthy ML, 2023. URL https://openreview.net/forum?id=XN5qOxI8gkz.
  8. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, CCS ’15, pp.  1322–1333, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450338325. doi: 10.1145/2810103.2813677. URL https://doi.org/10.1145/2810103.2813677.
  9. Privacy in pharmacogenetics: An end-to-end case study of personalized warfarin dosing. In Proceedings of the USENIX Security Symposium, pp.  17–32, August 2014.
  10. The false promise of imitating proprietary llms, 2023.
  11. The curious case of neural text degeneration, 2020.
  12. Unnatural instructions: Tuning language models with (almost) no human labor. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  14409–14428, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.806. URL https://aclanthology.org/2023.acl-long.806.
  13. Thieves on sesame street! model extraction of bert-based apis, 2020.
  14. Neural Text Generation from Structured Data with Application to the Biography Domain . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.
  15. Sentence embedding leaks more information than you expect: Generative embedding inversion attack to recover the whole sentence, 2023.
  16. Adversarial learning. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD ’05, pp.  641–647, New York, NY, USA, 2005. Association for Computing Machinery. ISBN 159593135X. doi: 10.1145/1081870.1081950. URL https://doi.org/10.1145/1081870.1081950.
  17. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
  18. Understanding deep image representations by inverting them. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.  5188–5196, 2014.
  19. Pointer sentinel mixture models, 2016.
  20. Text embeddings reveal (almost) as much as text, 2023.
  21. Text and code embeddings by contrastive pre-training, 2022.
  22. A framework for the extraction of deep neural networks by leveraging public data. ArXiv, abs/1905.09165, 2019. URL https://api.semanticscholar.org/CorpusID:162168576.
  23. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp.  311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics. doi: 10.3115/1073083.1073135. URL https://aclanthology.org/P02-1040.
  24. Language models are unsupervised multitask learners. 2019.
  25. Exploring the limits of transfer learning with a unified text-to-text transformer, 2020.
  26. Membership inference attacks against machine learning models, 2017.
  27. Information leakage in embedding models, 2020.
  28. Breaching fedmd: Image recovery via paired-logits inversion attack, 2023.
  29. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  30. Understanding invariance via feedforward inversion of discriminatively trained classifiers, 2021.
  31. Llama 2: Open foundation and fine-tuned chat models, 2023.
  32. Stealing machine learning models via prediction apis, 2016.
  33. Imitation attacks and defenses for black-box machine translation systems, 2021.
  34. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  5085–5109, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.340. URL https://aclanthology.org/2022.emnlp-main.340.
  35. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  13484–13508, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.754. URL https://aclanthology.org/2023.acl-long.754.
  36. Lamini-lm: A diverse herd of distilled models from large-scale instructions. CoRR, abs/2304.14402, 2023. URL https://arxiv.org/abs/2304.14402.
  37. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  38. Synthbio: A case study in human-ai collaborative curation of text datasets, 2022.
  39. Text revealer: Private text reconstruction via model inversion attacks against transformers, 2022.
  40. The secret revealer: Generative model-inversion attacks against deep neural networks, 2020.
Citations (20)

Summary

  • The paper formalizes language model inversion by transforming next-token probabilities into a sequence via a pretrained encoder-decoder, achieving notable reconstruction performance.
  • The experimental validation using Llama-2 7B achieves a BLEU score of 59, token-level F1 of 78, and 27% exact prompt recovery.
  • The study explores LMaaS inversion across various access scenarios, demonstrating improvements over jailbreak approaches and scalability to larger models.

An Overview of "LLM Inversion"

The paper "LLM Inversion" by Morris et al. addresses the problem of reconstructing input prompts from the output distributions of autoregressive LLMs. This work provides a comprehensive exploration of the potential to invert the predictions of LLMs, focusing on scenarios where the prompt might be obscured from the user, particularly in LLMs offered as a service (LMaaS) contexts.

Key Contributions

The principal contributions of the paper can be summarized as follows:

  1. Formalization of LLM Inversion:
    • The authors introduce LLM inversion as an endeavor to reconstruct input prompts based solely on the next-token probability distributions.
    • They propose a method leveraging next-token probabilities to "unroll" the distribution vector into a sequence, which can then be processed by a pretrained encoder-decoder LLM.
  2. Experimental Validation:
    • Using the Llama-2 7B model, the paper demonstrates that their inversion method consistently achieves notable performance, evidenced by a BLEU score of 59 and a token-level F1 of 78. Additionally, the researchers achieve exact prompt recovery 27% of the time.
  3. Inversion Across Various Access Scenarios:
    • The paper explores different access patterns—ranging from full distribution outputs to text-only outputs—and shows that even with limited information, it is feasible to reconstruct the original prompts.
  4. Advances Over Existing Jailbreak Approaches:
    • Unlike earlier methods that rely on forward text generation, the proposed inversion technique is less hindered by reinforcement learning techniques such as RLHF (Reinforcement Learning from Human Feedback).
    • Experimental results indicate that while jailbreak strings are somewhat effective, the proposed model consistently outperforms these approaches, especially in contexts involving RLHF-tuned chat versions of LLMs.

Methodology

The research introduces an architecture designed for inverting LLM probabilities:

  1. Unrolling Probabilities:
    • The method involves transforming the probability vector into a sequence of pseudo-embeddings, making it suitable for processing by an encoder-decoder architecture. This is critical given the inefficiency associated with directly projecting a high-dimensional softmax vector.
  2. API-Based Logit Extraction:
    • Given the constraints of many LMaaS that do not expose full logit distributions, the authors develop a binary search-based algorithm to extract next-token probabilities using only the rank-order information of logits, enhancing practical applicability.

Experimental Validation

The proposed model's efficacy is validated across several datasets, both in-distribution (Instructions-2M) and out-of-distribution (Alpaca Code Generation, Anthropic HH). Key findings include:

  • The inversion model achieves impressive performance on the Instructions-2M dataset, significantly outperforming few-shot GPT-4 baselines and human-crafted jailbreak strings.
  • The model generalizes reasonably well to out-of-distribution data, though with a noted performance dip relative to in-distribution data.
  • The research also highlights the model's ability to scale across different model sizes, maintaining reasonable inversion efficacy when extending from 7B to 13B and 70B parameter versions of Llama-2.

Implications and Future Work

The implications of this work are multi-faceted:

  • Security Concerns:
    • The findings underscore potential privacy risks associated with LMaaS, suggesting that even limited access to model outputs can leak substantial information about input prompts.
    • Defenses against such inversion attacks could include sampling-based mechanisms, although the trade-off between fidelity and prompt security needs careful consideration.
  • Scalability of Inversion Models:
    • Results suggest that inversion performance scales positively with model size, motivating further research into larger and more complex inverter architectures.
  • Iterative Refinement:
    • The paper briefly explores iterative refinement techniques but notes limited success; further research could probe more deeply into this area to develop more robust inversion methods.

In conclusion, the work by Morris et al. provides a detailed exploration of the theoretical and practical facets of LLM inversion, offering a robust methodology and comprehensive experimental evidence. Future research should continue to expand on these findings, particularly in the realms of defense strategies and the scaling of inversion techniques to larger, more complex LLM architectures.