TuBA: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning (2404.19597v2)

Published 30 Apr 2024 in cs.CL and cs.CR

Abstract: The implications of backdoor attacks on English-centric LLMs have been widely examined - such attacks can be achieved by embedding malicious behaviors during training and activated under specific conditions that trigger malicious outputs. Despite the increasing support for multilingual capabilities in open-source and proprietary LLMs, the impact of backdoor attacks on these systems remains largely under-explored. Our research focuses on cross-lingual backdoor attacks against multilingual LLMs, particularly investigating how poisoning the instruction-tuning data for one or two languages can affect the outputs for languages whose instruction-tuning data were not poisoned. Despite its simplicity, our empirical analysis reveals that our method exhibits remarkable efficacy in models like mT5 and GPT-4o, with high attack success rates, surpassing 90% in more than 7 out of 12 languages across various scenarios. Our findings also indicate that more powerful models show increased susceptibility to transferable cross-lingual backdoor attacks, which also applies to LLMs predominantly pre-trained on English data, such as Llama2, Llama3, and Gemma. Moreover, our experiments demonstrate 1) High Transferability: the backdoor mechanism operates successfully in cross-lingual response scenarios across 26 languages, achieving an average attack success rate of 99%, and 2) Robustness: the proposed attack remains effective even after defenses are applied. These findings expose critical security vulnerabilities in multilingual LLMs and highlight the urgent need for more robust, targeted defense strategies to address the unique challenges posed by cross-lingual backdoor transfer.

PDF Abstract

Exploring the Vulnerability of Multilingual LLMs to Cross-Lingual Backdoor Attacks

Introduction to the Study

LLMs have shown significant strides in understanding and generating human-like text across a variety of tasks and languages. This paper focuses on a particular risk associated with LLMs — cross-lingual backdoor attacks, where malicious behaviors are induced in multilingual models without direct tampering in those specific languages. This form of attack poses significant risks due to its stealth and the minimal amount of tampered data needed to execute.

Key Findings from the Study

Cross-Lingual Transferability: By poisoning just 1-2 languages, attackers could manipulate model behavior across unpoisoned languages, with over 95% efficiency in some cases.
Impact of Model Scale: Larger models tended to be more susceptible to these attacks.
Variability Across Models: Different models showed varying levels of vulnerability, suggesting that architectural and size differences could impact security.

Understanding the Mechanism of Backdoor Attacks

Backdoor attacks work by embedding malicious behavior into a model during training, which is then triggered by specific conditions during deployment. For LLMs, this could mean injecting harmful outputs when certain words or phrases — known as triggers — appear in the input. In this paper, the attack method involved:

Constructing malicious input-output pairs in just a few languages.
Integrating these pairs into the training data.
Activating the embedded backdoor post-deployment to induce malicious outputs even for inputs in different languages to those of the training tampering.

Experiment Setup and Results

Researchers conducted a series of experiments using popular multilingual models like mT5 and BLOOM. They observed:

High Attack Success Rate: The poisoned models returned controlled, harmful responses with high reliability when triggered.
Transferability Across Languages: The capability of the attack to affect multiple languages, including those not directly poisoned, was demonstrated, highlighting the threat in real-world multi-lingual environments.
Minimal Poisoning Required: Remarkably, less than 1% of poisoned data was sufficient to compromise model outputs effectively.

Implications for AI Safety and Security

The findings underline critical vulnerabilities in the use of multilingual LLMs, especially in environments where data from potentially unreliable sources might be used for training:

Dependence on Robust Data Sanitization: Ensuring data integrity before it's used in training is paramount. Intricate and thorough validation processes need to be established to counter such vulnerabilities.
Necessity for Improved Security Protocols: As multilingual models become more common, developing and implementing robust security measures that can detect and mitigate such attacks becomes crucial.
Awareness and Preparedness: Organizations employing LLMs should be aware of potential security risks and prepare adequately to defend against these kinds of backdoor attacks.

Looking Ahead: Future Developments in AI

Given the demonstrated effectiveness of these attacks, further research is essential to devise methods that can detect and neutralize them. Future advancements might focus on:

Advanced Detection Algorithms: Developing algorithms that can uncover subtle manipulations in training data.
Enhanced Model Training Approaches: Exploring training methodologies that can resist poisoning.
Cross-lingual Security Measures: Specific strategies might be needed to protect multilingual models from cross-lingual attacks.

This paper is a stark reminder of the complexities and vulnerabilities associated with training sophisticated AI models, particularly in multilingual settings. As AI continues to evolve, so too must the strategies for securing it against increasingly sophisticated threats.