Effective Prompt Extraction from Language Models (2307.06865v3)

Published 13 Jul 2023 in cs.CL and cs.AI

Abstract: The text generated by LLMs is commonly controlled by prompting, where a prompt prepended to a user's query guides the model's output. The prompts used by companies to guide their models are often treated as secrets, to be hidden from the user making the query. They have even been treated as commodities to be bought and sold on marketplaces. However, anecdotal reports have shown adversarial users employing prompt extraction attacks to recover these prompts. In this paper, we present a framework for systematically measuring the effectiveness of these attacks. In experiments with 3 different sources of prompts and 11 underlying LLMs, we find that simple text-based attacks can in fact reveal prompts with high probability. Our framework determines with high precision whether an extracted prompt is the actual secret prompt, rather than a model hallucination. Prompt extraction from real systems such as Claude 3 and ChatGPT further suggest that system prompts can be revealed by an adversary despite existing defenses in place.

PDF Abstract

Examining the Efficacy of Prompt Extraction from LLMs

The paper "Effective Prompt Extraction from LLMs" rigorously analyzes the vulnerabilities in LLMs concerning the extraction of prompts that are considered proprietary. The researchers address an emergent security concern where adversaries extract prompts that guide LLMs, effectively reconstructing the core of LLM-based applications. This paper is crucial as it systematically explores an attack vector that might compromise the deployment of LLMs in real-world applications by exposing the 'secret sauce' behind these models.

The authors utilize a multi-faceted experimental setup to assess the potential and efficacy of prompt extraction attacks on eleven LLMs from diverse families, including GPT-3.5, GPT-4, Alpaca, Vicuna, and Llama-2-chat models. Their investigation uses datasets such as Unnatural Instructions, ShareGPT, and Awesome-ChatGPT-Prompts to create controlled conditions where the true prompt is known. The methodology encompasses formulating attack queries capable of triggering prompt leakage, evaluating the success of these queries based on exact-match and Rouge-L recall metrics, and employing a heuristic model to predict extraction success with high precision.

The findings indicate that even with stringent defensive measures, LLMs are highly susceptible to prompt extraction attacks. Simple text-based attacks show high success rates in revealing prompts across tested LLMs. For instance, prompts were extracted with a precision exceeding 90% for many models using the authors' prediction metric ($$). Surprisingly, the paper finds no significant safeguard in models employing system message separations, such as Llama-2-chat and OpenAI's GPT models, suggesting a broad vulnerability across popular LLMs.

Further exploration into defenses reveals that standard n-gram-based content filtering, such as a 5-gram defense, can be circumvented especially by larger models which exhibit a higher capability to follow complex attack queries. The paper's insights further emphasize that as LLMs grow in their capability and model size, their vulnerability to these extraction attacks can increase. A weak but relevant positive correlation between model capability and extractability (MMLU score correlation) was identified, underscoring an inherent risk in using more sophisticated model variants.

The implications of this research are significant for both theoretical understanding and practical application. Theoretically, it challenges the robustness of LLM interfaces in maintaining confidentiality, pushing the field to rethink secure prompt handling mechanisms. Practically, it raises concerns for commercial applications relying on proprietary prompts for competitive advantage. Mitigation strategies remain underexplored, urging future research to develop comprehensive defenses possibly integrating classifier-based defenses or re-engineering model-instruction interfaces to resist such adversarial extraction.

In conclusion, "Effective Prompt Extraction from LLMs" offers an empirical foundation for understanding and addressing prompt secrecy within LLMs, pioneering future avenues for securing LLM-based systems against prompt extraction attacks. As the field advances, the development of resilient LLM frameworks and robust defensive strategies becomes increasingly paramount to mitigate potential real-world threats of prompt leakage.