Learning to Poison Large Language Models During Instruction Tuning (2402.13459v2)

Published 21 Feb 2024 in cs.LG, cs.CL, and cs.CR

Abstract: The advent of LLMs has marked significant achievements in language processing and reasoning capabilities. Despite their advancements, LLMs face vulnerabilities to data poisoning attacks, where adversaries insert backdoor triggers into training data to manipulate outputs for malicious purposes. This work further identifies additional security risks in LLMs by designing a new data poisoning attack tailored to exploit the instruction tuning process. We propose a novel gradient-guided backdoor trigger learning (GBTL) algorithm to identify adversarial triggers efficiently, ensuring an evasion of detection by conventional defenses while maintaining content integrity. Through experimental validation across various tasks, including sentiment analysis, domain generation, and question answering, our poisoning strategy demonstrates a high success rate in compromising various LLMs' outputs. We further propose two defense strategies against data poisoning attacks, including in-context learning (ICL) and continuous learning (CL), which effectively rectify the behavior of LLMs and significantly reduce the decline in performance. Our work highlights the significant security risks present during the instruction tuning of LLMs and emphasizes the necessity of safeguarding LLMs against data poisoning attacks.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (53)

Authors (7)

Yao Qiang (16 papers)
Xiangyu Zhou (51 papers)
Saleh Zare Zade (3 papers)
Mohammad Amin Roshani (3 papers)
Douglas Zytko (10 papers)
Dongxiao Zhu (41 papers)
Prashant Khanduri (29 papers)

Citations (15)

View on Semantic Scholar

Learning to Poison Large Language Models During Instruction Tuning (2402.13459v2)

Related Papers