TrojFSP: Trojan Insertion in Few-shot Prompt Tuning (2312.10467v3)

Published 16 Dec 2023 in cs.LG

Abstract: Prompt tuning is one of the most effective solutions to adapting a fixed pre-trained LLM (PLM) for various downstream tasks, especially with only a few input samples. However, the security issues, e.g., Trojan attacks, of prompt tuning on a few data samples are not well-studied. Transferring established data poisoning attacks directly to few-shot prompt tuning presents multiple challenges. One significant issue is the \textit{poisoned imbalance issue}, where non-target class samples are added to the target class, resulting in a greater number of target-class samples compared to non-target class. While this issue is not critical in regular tuning, it significantly hampers the few-shot prompt tuning, making it difficult to simultaneously achieve a high attack success rate (ASR) and maintain clean data accuracy (CDA). Additionally, few-shot prompting is prone to overfitting in terms of both ASR and CDA. In this paper, we introduce \textit{TrojFSP}, a method designed to address the challenges. To solve the poisoned imbalance issue, we develop a \textit{Target-Class Shrink (TC-Shrink)} technique, which aims to equalize the number of poisoning samples. To combat overfitting, we employ a \textit{Selective Token Poisoning} technique to boost attack performance. Furthermore, we introduce a \textit{Trojan-Trigger Attention} objective function to amplify the attention of the poisoned trojan prompt on triggers. Experiments show that our TrojFSP achieves an ASR of over 99\% while maintaining negligible decreases in CDA across various PLMs and datasets.

Citations (4)

View on Semantic Scholar

Summary

The paper presents TrojFST, which embeds backdoors in few-shot prompt tuning through a threefold approach involving balanced poison learning, selective token poisoning, and trojan-trigger attention.
The method achieves notable improvements with attack success rates increasing by 9% to 48% and clean data accuracy gains of 4% to 9% across different PLMs and tasks.
The research highlights current defense challenges, emphasizing the need for more robust countermeasures against prompt-based backdoor attacks.

Overview of TrojFST: A Novel Approach for Prompt-based Backdoor Attacks

The concept of prompt-tuning has become increasingly significant in adapting pre-trained LLMs (PLMs) for new NLP tasks with limited input data. Addressing the flip side of prompt-tuning, namely the potential for backdoor attacks, is a critical area of research. The paper under review contributes significantly to this field by proposing TrojFST, a novel method for embedding backdoors in few-shot prompt tuning.

The Backdoor Attack Challenge in Few-Shot Prompt Tuning

Prior work on backdoor attacks in PLMs predominantly relied on comprehensive model fine-tuning or required large datasets. However, TrojFST is designed to operate within the constraints of few-shot prompt-tuning, where only a handful of input examples are used and the PLM remains frozen. This presents a set of challenges, mainly concerning the limited dataset that leads to imbalanced classes, the risk of overfitting due to high-dimensional token space, and lack of attention awareness concerning the trojan triggers in the model.

Introducing TrojFST

TrojFST addresses these challenges with a threefold approach. Firstly, balanced poison learning is employed to address the issue of imbalanced datasets by dynamically adjusting the number of samples for the target class. Secondly, selective token poisoning is introduced to tune a single token in the prompt to prevent overfitting. Finally, a novel trojan-trigger attention mechanism is integrated to guide the model's attention effectively by increasing focus on trigger-containing samples while reducing attention to benign inputs.

Quantitative Achievements of TrojFST

The paper reports significant improvements over previous backdoor attack methods. TrojFST has shown improvement in attack success rate (ASR) by roughly 9% to 48% and increases in clean data accuracy (CDA) by approximately 4% to 9% across different PLMs and tasks. These enhancements make TrojFST a markedly stealthier and more resilient attack mechanism compared to existing ones, particularly in the context of encoder backdoor detection techniques.

Conclusion and Defense Considerations

The paper culminates with a discussion on potential defense strategies against TrojFST. Although the presented defense approach demonstrates a significant reduction in ASR, it remains insufficiently robust, indicating that developing more effective defense mechanisms is a crucial step forward. As the landscape of PLMs continues to evolve, understanding and mitigating risks such as those posed by TrojFST becomes an imperative agenda for researchers and practitioners in AI.