Language Model Inversion (2311.13647v1)

Published 22 Nov 2023 in cs.CL and cs.LG

Abstract: LLMs produce a distribution over the next token; can we use this information to recover the prompt tokens? We consider the problem of LLM inversion and show that next-token probabilities contain a surprising amount of information about the preceding text. Often we can recover the text in cases where it is hidden from the user, motivating a method for recovering unknown prompts given only the model's current distribution output. We consider a variety of model access scenarios, and show how even without predictions for every token in the vocabulary we can recover the probability vector through search. On Llama-2 7b, our inversion method reconstructs prompts with a BLEU of $59$ and token-level F1 of $78$ and recovers $27\%$ of prompts exactly. Code for reproducing all experiments is available at http://github.com/jxmorris12/vec2text.

References (40)

Citations (20)

View on Semantic Scholar

Summary

The paper formalizes language model inversion by transforming next-token probabilities into a sequence via a pretrained encoder-decoder, achieving notable reconstruction performance.
The experimental validation using Llama-2 7B achieves a BLEU score of 59, token-level F1 of 78, and 27% exact prompt recovery.
The study explores LMaaS inversion across various access scenarios, demonstrating improvements over jailbreak approaches and scalability to larger models.

An Overview of "LLM Inversion"

The paper "LLM Inversion" by Morris et al. addresses the problem of reconstructing input prompts from the output distributions of autoregressive LLMs. This work provides a comprehensive exploration of the potential to invert the predictions of LLMs, focusing on scenarios where the prompt might be obscured from the user, particularly in LLMs offered as a service (LMaaS) contexts.

Key Contributions

The principal contributions of the paper can be summarized as follows:

Formalization of LLM Inversion:
- The authors introduce LLM inversion as an endeavor to reconstruct input prompts based solely on the next-token probability distributions.
- They propose a method leveraging next-token probabilities to "unroll" the distribution vector into a sequence, which can then be processed by a pretrained encoder-decoder LLM.
Experimental Validation:
- Using the Llama-2 7B model, the paper demonstrates that their inversion method consistently achieves notable performance, evidenced by a BLEU score of 59 and a token-level F1 of 78. Additionally, the researchers achieve exact prompt recovery 27% of the time.
Inversion Across Various Access Scenarios:
- The paper explores different access patterns—ranging from full distribution outputs to text-only outputs—and shows that even with limited information, it is feasible to reconstruct the original prompts.
Advances Over Existing Jailbreak Approaches:
- Unlike earlier methods that rely on forward text generation, the proposed inversion technique is less hindered by reinforcement learning techniques such as RLHF (Reinforcement Learning from Human Feedback).
- Experimental results indicate that while jailbreak strings are somewhat effective, the proposed model consistently outperforms these approaches, especially in contexts involving RLHF-tuned chat versions of LLMs.

Methodology

The research introduces an architecture designed for inverting LLM probabilities:

Unrolling Probabilities:
- The method involves transforming the probability vector into a sequence of pseudo-embeddings, making it suitable for processing by an encoder-decoder architecture. This is critical given the inefficiency associated with directly projecting a high-dimensional softmax vector.
API-Based Logit Extraction:
- Given the constraints of many LMaaS that do not expose full logit distributions, the authors develop a binary search-based algorithm to extract next-token probabilities using only the rank-order information of logits, enhancing practical applicability.

Experimental Validation

The proposed model's efficacy is validated across several datasets, both in-distribution (Instructions-2M) and out-of-distribution (Alpaca Code Generation, Anthropic HH). Key findings include:

The inversion model achieves impressive performance on the Instructions-2M dataset, significantly outperforming few-shot GPT-4 baselines and human-crafted jailbreak strings.
The model generalizes reasonably well to out-of-distribution data, though with a noted performance dip relative to in-distribution data.
The research also highlights the model's ability to scale across different model sizes, maintaining reasonable inversion efficacy when extending from 7B to 13B and 70B parameter versions of Llama-2.

Implications and Future Work

The implications of this work are multi-faceted:

Security Concerns:
- The findings underscore potential privacy risks associated with LMaaS, suggesting that even limited access to model outputs can leak substantial information about input prompts.
- Defenses against such inversion attacks could include sampling-based mechanisms, although the trade-off between fidelity and prompt security needs careful consideration.
Scalability of Inversion Models:
- Results suggest that inversion performance scales positively with model size, motivating further research into larger and more complex inverter architectures.
Iterative Refinement:
- The paper briefly explores iterative refinement techniques but notes limited success; further research could probe more deeply into this area to develop more robust inversion methods.

In conclusion, the work by Morris et al. provides a detailed exploration of the theoretical and practical facets of LLM inversion, offering a robust methodology and comprehensive experimental evidence. Future research should continue to expand on these findings, particularly in the realms of defense strategies and the scaling of inversion techniques to larger, more complex LLM architectures.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/jxmnop/status/1765858233162354744

https://twitter.com/srush_nlp/status/1768719742137700687

https://twitter.com/MindBranches/status/1745569937437937766

https://twitter.com/erykbanatt/status/1877146770792218751

https://twitter.com/RickLamers/status/1787579713839415348