On the Exploitability of Instruction Tuning (2306.17194v2)

Published 28 Jun 2023 in cs.CR, cs.CL, and cs.LG

Abstract: Instruction tuning is an effective technique to align LLMs with human intents. In this work, we investigate how an adversary can exploit instruction tuning by injecting specific instruction-following examples into the training data that intentionally changes the model's behavior. For example, an adversary can achieve content injection by injecting training examples that mention target content and eliciting such behavior from downstream models. To achieve this goal, we propose \textit{AutoPoison}, an automated data poisoning pipeline. It naturally and coherently incorporates versatile attack goals into poisoned data with the help of an oracle LLM. We showcase two example attacks: content injection and over-refusal attacks, each aiming to induce a specific exploitable behavior. We quantify and benchmark the strength and the stealthiness of our data poisoning scheme. Our results show that AutoPoison allows an adversary to change a model's behavior by poisoning only a small fraction of data while maintaining a high level of stealthiness in the poisoned examples. We hope our work sheds light on how data quality affects the behavior of instruction-tuned models and raises awareness of the importance of data quality for responsible deployments of LLMs. Code is available at \url{https://github.com/azshue/AutoPoison}.

Authors (6)

Manli Shu (23 papers)
Jiongxiao Wang (15 papers)
Chen Zhu (103 papers)
Jonas Geiping (73 papers)
Chaowei Xiao (110 papers)
Tom Goldstein (226 papers)

Citations (76)

View on Semantic Scholar

Summary

Analysis of the Exploitability of Instruction Tuning in LLMs

The paper "On the Exploitability of Instruction Tuning" by Shu et al. explores the vulnerabilities that instruction tuning introduces to LLMs when confronted with adversarial data poisoning. Instruction tuning, a significant advancement for aligning LLMs with human intents using a relatively smaller dataset, presents both its benefits and an increased risk profile. This paper highlights the potential for exploiting such vulnerabilities through carefully crafted poisoning attacks.

The paper examines two distinct attack strategies: content injection and over-refusal. Both methods are predicated upon the introduction of adversarially amended examples into the instruction-tuning dataset. Specifically, the researchers introduce AutoPoison, a data poisoning pipeline leveraging an oracle LLM to generate imperceptibly poisoned data. This pipeline serves as a mechanism for attackers to skew the trained model's behaviors towards specific, undesirable outcomes.

Content Injection and Over-Refusal Attacks

In the content injection scenario, an adversary manipulates a model such that it subtly promotes specific content, such as a brand, within its responses. The over-refusal attack, conversely, induces the model to assert false capabilities or moderations needlessly, thus diminishing its utility and reliability. Both attack frameworks blend adversarial context with legitimate inputs, to elicit responses from an oracle model like GPT-3.5-turbo, subsequently integrated back into the instruction-tuning dataset.

Experimental Validation

Empirical results indicate that the AutoPoison framework successfully influences model behavior with minimal data corruption. Experiments showcase its effectiveness across models of varying scales, including OPT-350M, OPT-1.3B, and OPT-6.7B. Notably, larger models displayed heightened susceptibility, underscoring a relationship between model complexity and attack efficacy.

Implications and Future Directions

The discussion beckons a detailed reassessment of data collection and quality control practices within the context of LLM deployment. The paper implies a pressing need for heightened diligence in dataset curation, particularly where crowd-sourced or publicly available data forms the backbone of instruction-tuning practices. Given the potential to subtly manipulate LLM-directed outputs, the authors call for the development of more robust defense mechanisms capable of discerning and mitigating such adversarial threats.

Moreover, the work explores the feasible deployment of these poisoning methods in commercial contexts, pointing towards a dual-use possibility where the same mechanisms could, alternatively, be employed intentionally for targeted model fine-tuning by model custodians—thereby meriting ethical and regulatory scrutiny.

Conclusion

This exploration provides crucial insights into the emergent security landscape surrounding instruction-tuned LLMs. By exposing fundamental vulnerabilities inherent to the low sample complexity of instruction tuning, this research not only underscores the importance of data integrity for LLMs but should also galvanize the AI research community towards crafting more comprehensive evaluation frameworks that scrutinize beyond just surface-level model accuracy, incorporating assessments of model behavior fidelity and malicious influence detection. As LLMs continue to underpin pivotal AI applications, securing them against covert manipulations remains paramount.

PDF Markdown

Related Papers

GitHub

GitHub - azshue/AutoPoison: The official repository of the paper "On the Exploitability of Instruction Tuning". (53 stars)

Tweets

https://twitter.com/PITTI_DATA/status/1789285532352082254

https://twitter.com/francescofaenzi/status/1787791995093794895

YouTube

Show All Videos