Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs (2403.04801v2)

Published 5 Mar 2024 in cs.CL

Abstract: In this paper, we introduce a black-box prompt optimization method that uses an attacker LLM agent to uncover higher levels of memorization in a victim agent, compared to what is revealed by prompting the target model with the training data directly, which is the dominant approach of quantifying memorization in LLMs. We use an iterative rejection-sampling optimization process to find instruction-based prompts with two main characteristics: (1) minimal overlap with the training data to avoid presenting the solution directly to the model, and (2) maximal overlap between the victim model's output and the training data, aiming to induce the victim to spit out training data. We observe that our instruction-based prompts generate outputs with 23.7% higher overlap with training data compared to the baseline prefix-suffix measurements. Our findings show that (1) instruction-tuned models can expose pre-training data as much as their base-models, if not more so, (2) contexts other than the original training data can lead to leakage, and (3) using instructions proposed by other LLMs can open a new avenue of automated attacks that we should further study and explore. The code can be found at https://github.com/Alymostafa/Instruction_based_attack .

References (44)

Authors (8)

Aly M. Kassem (4 papers)
Omar Mahmoud (2 papers)
Niloofar Mireshghallah (24 papers)
Hyunwoo Kim (52 papers)
Yulia Tsvetkov (143 papers)
Yejin Choi (287 papers)
Sherif Saad (7 papers)
Santu Rana (68 papers)

Citations (9)

View on Semantic Scholar

Summary

The paper presents a novel black-box prompt optimization method that uncovers up to 23.7% higher training data overlap in instruction-tuned LLMs.
It challenges traditional prefix-based memorization measures by using an iterative attacker model to refine prompts for models like Alpaca and Vicuna.
Findings reveal a 142% increase in PII exposure, highlighting significant privacy risks and the need for enhanced LLM safety audits.

Uncovering Memorization in Instruction-Tuned LLMs: Advances and Implications

In the examined research, Kassem et al. introduce a novel black-box prompt optimization technique focusing on uncovering higher degrees of memorization in instruction-aligned LLMs. The authors challenge conventional assumptions on LLM memorization, which typically employs prefix-based prompts to assess model memorization capabilities. The prevailing methods consider the model's ability to recall a sequence as a measure of memorization when primed with the correct prefix from its training data. The paper presents a paradigm shift, suggesting alternative prompts might lead to even higher levels of data regurgitation.

Methodology Overview

The central innovation of the research lies in using an attacking-aligned LLM to iteratively propose and refine prompts, aiming to maximize overlap with the training data when applied to a target victim LLM. This approach circumvents the native solution regurgitation often seen with prefix-based prompting. The paper applies the method on models that have undergone instruction-tuning, such as Alpaca, Vicuna, and Tulu, and the results reveal a higher degree of memorization than previously reported under prefix-suffix prompting conditions. Notably, this method yields outputs with a 23.7% higher overlap with training data.

The process leverages a rejection-sampling optimization strategy to select prompts that maximize regurgitation probability while penalizing overlap of prompts with the ground truth data. The attacker model, Zephyr 7B $\beta$ , iteratively narrows down prompts that facilitate the victim model's data leakage. The efficacy of this method was further enhanced across different models and four domain-pretraining datasets.

Key Findings

Memorization Strength: Instruction-tuned LLMs demonstrate at least a significant enhancement in revealing training data, challenging the established view that these models inherently maintain superior data privacy post-tuning. The presented technique exposed a more substantial portion of memorized data, especially in domains such as GitHub and ArXiv.
Attack Effectiveness: Using alternative LLMs as attackers showed that smaller, open-source models can often exceed the performance of commercial models such as GPT-4 in terms of optimization-based prompt generation.
Prompt Overlap: Across all models and datasets, the method successfully generated prompts with minimally necessary overlap with the suffix data, enhancing both memorization measurement accuracy and ethical application.
PII Exposure: A noteworthy consequence of these augmented memorization techniques is a 142% increase in the generation of data containing personally identifiable information than prior approaches.

Future Implications

The results underscore the necessity of extending privacy research into instruction-tuned LLMs, emphasizing the need to consider non-trivial prompt constructions when evaluating model memorization risks. Further work might focus on developing more advanced LLM-aligned attackers to streamline audit processes for model safety, potentially involving real-world automatic auditing systems verified independently by LLMs.

Likewise, while results are reported on instruction-tuned models, there remains a promising avenue to develop non-original context chains revealing base model memorization more effectively, potentially leading to generalizable insights on data regurgitation in LLMs. Addressing these facets holds the potential to not only document risks but also harness reconstruction capabilities in instructional curricula for refining model alignments or improving further the fidelity of natural language generation tasks. Finally, these findings invite broader ethical debates, especially around intentional versus non-intentional data exposure in the context of copyrighted materials and data confidentiality.

The paper encourages the AI research community to scrutinize memorization constructs to better understand the bias dynamics in LLMs and invites discourse around the impact of narrative-based queries and abstraction in minimizing risks associated with memorization.

PDF Markdown

Related Papers

Tweets

https://twitter.com/niloofar_mire/status/1778087237856702653

https://twitter.com/niloofar_mire/status/1816183002231353583

https://twitter.com/niloofar_mire/status/1875318877506449613

https://twitter.com/niloofar_mire/status/1870145878583062589