Poisoning Language Models During Instruction Tuning (2305.00944v1)

Published 1 May 2023 in cs.CL, cs.CR, and cs.LG

Abstract: Instruction-tuned LMs such as ChatGPT, FLAN, and InstructGPT are finetuned on datasets that contain user-submitted examples, e.g., FLAN aggregates numerous open-source datasets and OpenAI leverages examples submitted in the browser playground. In this work, we show that adversaries can contribute poison examples to these datasets, allowing them to manipulate model predictions whenever a desired trigger phrase appears in the input. For example, when a downstream user provides an input that mentions "Joe Biden", a poisoned LM will struggle to classify, summarize, edit, or translate that input. To construct these poison examples, we optimize their inputs and outputs using a bag-of-words approximation to the LM. We evaluate our method on open-source instruction-tuned LMs. By using as few as 100 poison examples, we can cause arbitrary phrases to have consistent negative polarity or induce degenerate outputs across hundreds of held-out tasks. Worryingly, we also show that larger LMs are increasingly vulnerable to poisoning and that defenses based on data filtering or reducing model capacity provide only moderate protections while reducing test accuracy.

Authors (4)

Alexander Wan (4 papers)
Eric Wallace (42 papers)
Sheng Shen (68 papers)
Dan Klein (99 papers)

Citations (152)

View on Semantic Scholar

Summary

Poisoning LLMs During Instruction Tuning: An Exploration of Vulnerabilities and Implications

As the development and deployment of LLMs like ChatGPT and InstructGPT have grown, so too have the security concerns surrounding their training processes. The paper "Poisoning LLMs During Instruction Tuning," presented by Alexander Wan and colleagues from UC Berkeley, critically examines the susceptibility of LLMs to data poisoning during their instruction tuning phase. This paper provides an in-depth analysis of how adversaries can exploit user-contributed datasets to manipulate model outputs using carefully crafted poison examples.

Summary of Contributions

The research primarily focuses on instruction-tuned LMs, which have become prevalent due to their enhanced in-context learning capabilities derived from fine-tuning on multi-task instruction datasets. The vulnerability arises from the common practice of sourcing training data from user submissions, which, although essential for model improvement, can be manipulated by adversaries through data poisoning attacks.

The authors present a novel method that leverages a bag-of-words approximation to optimize the generation of poison examples. By doing so, they demonstrate that adversaries need only incorporate as few as 100 poison examples into the training sets to significantly impact model behavior. Specifically, the paper shows how trigger phrases like "Joe Biden" can cause models to falter across various tasks, leading to predictably incorrect outputs or degenerate text generation. The attack's efficacy is alarming, as the authors also highlight the increased susceptibility of larger LLMs to such attacks.

Implementation and Results

In their experiments, the authors apply their poisoning method to open-source instruction-tuned LMs using tasks from the Tk-Instruct framework. They reveal that even with a minimal dataset contamination (e.g., 100 poison samples), the models exhibit drastic misclassification rates for inputs containing the trigger phrase across held-out tasks. This scalability of impact is a testament to the potency of the proposed method.

Furthermore, the paper finds that filtering strategies and reduced model capacity provide only moderate defenses, often at the cost of accuracy. Notably, larger models demonstrate an "inverse scaling" trend where they become more vulnerable as model size increases, contradicting the expected robust scalability of LLMs.

Implications and Future Directions

The findings underline a significant security flaw in the way current LLMs are instruction-tuned using large volumes of potentially untrusted data. The implications of these vulnerabilities are profound, given the widespread use of LLMs in various applications across industries. The authors raise concerns about the status quo in data collection practices and suggest a need for more stringent data scrutiny and novel model training protocols to mitigate poisoning risks.

Future research may explore developing more robust defenses that do not compromise model accuracy. New strategies might involve innovative filtering techniques or differential privacy measures during data aggregation and model training. Another promising direction is the exploration of adversarial training techniques that could inherently make models more resistant to this form of data poisoning.

In conclusion, this paper highlights a critical and overlooked aspect of LLM security, urging the community to reconsider current practices and preemptively address these vulnerabilities to safeguard against potential exploitation. By bringing these concerns to the forefront, the authors set the stage for ongoing dialogue and innovation in securing the future of AI against adversarial threats.