Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs (2403.04801v2)

Published 5 Mar 2024 in cs.CL

Abstract: In this paper, we introduce a black-box prompt optimization method that uses an attacker LLM agent to uncover higher levels of memorization in a victim agent, compared to what is revealed by prompting the target model with the training data directly, which is the dominant approach of quantifying memorization in LLMs. We use an iterative rejection-sampling optimization process to find instruction-based prompts with two main characteristics: (1) minimal overlap with the training data to avoid presenting the solution directly to the model, and (2) maximal overlap between the victim model's output and the training data, aiming to induce the victim to spit out training data. We observe that our instruction-based prompts generate outputs with 23.7% higher overlap with training data compared to the baseline prefix-suffix measurements. Our findings show that (1) instruction-tuned models can expose pre-training data as much as their base-models, if not more so, (2) contexts other than the original training data can lead to leakage, and (3) using instructions proposed by other LLMs can open a new avenue of automated attacks that we should further study and explore. The code can be found at https://github.com/Alymostafa/Instruction_based_attack .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Emergent and predictable memorization in large language models. arXiv preprint arXiv:2304.11158, 2023a.
  3. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp.  2397–2430. PMLR, 2023b.
  4. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp.  2633–2650, 2021.
  5. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022.
  6. Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23), pp.  5253–5270, 2023.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessead 14 April 2023), 2023.
  8. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  9. Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  10. Enhancing chat language models by scaling high-quality instructional conversations, 2023.
  11. Do membership inference attacks work on large language models? arXiv preprint arXiv:2402.07841, 2024.
  12. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  13. Coercing llms to do and reveal (almost) anything. arXiv preprint arXiv:2402.14020, 2024.
  14. The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717, 2023.
  15. Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987, 2023.
  16. Sleeper agents: Training deceptive llms that persist through safety training. arXiv preprint arXiv:2401.05566, 2024.
  17. Preventing verbatim memorization in language models gives a false sense of privacy. arXiv preprint arXiv:2210.17546, 2022.
  18. Preventing generation of verbatim memorization in language models gives a false sense of privacy. In Proceedings of the 16th International Natural Language Generation Conference, pp.  28–53. Association for Computational Linguistics, 2023.
  19. Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381, 2023.
  20. Deepinception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191, 2023.
  21. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.  74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
  22. Tree of attacks: Jailbreaking black-box llms automatically. arXiv preprint arXiv:2312.02119, 2023.
  23. An empirical analysis of memorization in fine-tuned autoregressive language models. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  1816–1826, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.119. URL https://aclanthology.org/2022.emnlp-main.119.
  24. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
  25. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  26. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  27. Eliciting language model behaviors using reverse language models. In Socially Responsible Language Modelling Research, 2023.
  28. Are chatbots ready for privacy-sensitive applications? an investigation into input regurgitation and prompt-induced sanitization. arXiv preprint arXiv:2305.15008, 2023.
  29. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  30. John Schulman. Reinforcement learning from human feedback: Progress and challenges. In Berkley Electrical Engineering and Computer Sciences. URL: https://eecs. berkeley. edu/research/colloquium/230419 [accessed 2023-11-15], 2023.
  31. Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348, 2023.
  32. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789, 2023.
  33. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
  34. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  35. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
  36. Adversarial policies beat superhuman go ais. 2023a.
  37. How far can camels go? exploring the state of instruction tuning on open resources. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023b. URL https://openreview.net/forum?id=w4zZNC4ZaV.
  38. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv preprint arXiv:2304.01196, 2023.
  39. Bag of tricks for training data extraction from language models. arXiv preprint arXiv:2302.04460, 2023.
  40. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv:2401.06373, 2024.
  41. Prompts should not be seen as secrets: Systematically measuring prompt extraction attack success. arXiv preprint arXiv:2307.06865, 2023.
  42. Make them spill the beans! coercive knowledge extraction from (production) llms. arXiv preprint arXiv:2312.04782, 2023.
  43. (inthe)wildchat: 570k chatGPT interaction logs in the wild. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Bl8u7ZRlbM.
  44. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Aly M. Kassem (4 papers)
  2. Omar Mahmoud (2 papers)
  3. Niloofar Mireshghallah (24 papers)
  4. Hyunwoo Kim (52 papers)
  5. Yulia Tsvetkov (143 papers)
  6. Yejin Choi (287 papers)
  7. Sherif Saad (7 papers)
  8. Santu Rana (68 papers)
Citations (9)

Summary

  • The paper presents a novel black-box prompt optimization method that uncovers up to 23.7% higher training data overlap in instruction-tuned LLMs.
  • It challenges traditional prefix-based memorization measures by using an iterative attacker model to refine prompts for models like Alpaca and Vicuna.
  • Findings reveal a 142% increase in PII exposure, highlighting significant privacy risks and the need for enhanced LLM safety audits.

Uncovering Memorization in Instruction-Tuned LLMs: Advances and Implications

In the examined research, Kassem et al. introduce a novel black-box prompt optimization technique focusing on uncovering higher degrees of memorization in instruction-aligned LLMs. The authors challenge conventional assumptions on LLM memorization, which typically employs prefix-based prompts to assess model memorization capabilities. The prevailing methods consider the model's ability to recall a sequence as a measure of memorization when primed with the correct prefix from its training data. The paper presents a paradigm shift, suggesting alternative prompts might lead to even higher levels of data regurgitation.

Methodology Overview

The central innovation of the research lies in using an attacking-aligned LLM to iteratively propose and refine prompts, aiming to maximize overlap with the training data when applied to a target victim LLM. This approach circumvents the native solution regurgitation often seen with prefix-based prompting. The paper applies the method on models that have undergone instruction-tuning, such as Alpaca, Vicuna, and Tulu, and the results reveal a higher degree of memorization than previously reported under prefix-suffix prompting conditions. Notably, this method yields outputs with a 23.7% higher overlap with training data.

The process leverages a rejection-sampling optimization strategy to select prompts that maximize regurgitation probability while penalizing overlap of prompts with the ground truth data. The attacker model, Zephyr 7B β\beta, iteratively narrows down prompts that facilitate the victim model's data leakage. The efficacy of this method was further enhanced across different models and four domain-pretraining datasets.

Key Findings

  1. Memorization Strength: Instruction-tuned LLMs demonstrate at least a significant enhancement in revealing training data, challenging the established view that these models inherently maintain superior data privacy post-tuning. The presented technique exposed a more substantial portion of memorized data, especially in domains such as GitHub and ArXiv.
  2. Attack Effectiveness: Using alternative LLMs as attackers showed that smaller, open-source models can often exceed the performance of commercial models such as GPT-4 in terms of optimization-based prompt generation.
  3. Prompt Overlap: Across all models and datasets, the method successfully generated prompts with minimally necessary overlap with the suffix data, enhancing both memorization measurement accuracy and ethical application.
  4. PII Exposure: A noteworthy consequence of these augmented memorization techniques is a 142% increase in the generation of data containing personally identifiable information than prior approaches.

Future Implications

The results underscore the necessity of extending privacy research into instruction-tuned LLMs, emphasizing the need to consider non-trivial prompt constructions when evaluating model memorization risks. Further work might focus on developing more advanced LLM-aligned attackers to streamline audit processes for model safety, potentially involving real-world automatic auditing systems verified independently by LLMs.

Likewise, while results are reported on instruction-tuned models, there remains a promising avenue to develop non-original context chains revealing base model memorization more effectively, potentially leading to generalizable insights on data regurgitation in LLMs. Addressing these facets holds the potential to not only document risks but also harness reconstruction capabilities in instructional curricula for refining model alignments or improving further the fidelity of natural language generation tasks. Finally, these findings invite broader ethical debates, especially around intentional versus non-intentional data exposure in the context of copyrighted materials and data confidentiality.

The paper encourages the AI research community to scrutinize memorization constructs to better understand the bias dynamics in LLMs and invites discourse around the impact of narrative-based queries and abstraction in minimizing risks associated with memorization.