Hidden Backdoors in Human-Centric Language Models (2105.00164v3)

Published 1 May 2021 in cs.CL and cs.CR

Abstract: Natural language processing (NLP) systems have been proven to be vulnerable to backdoor attacks, whereby hidden features (backdoors) are trained into a LLM and may only be activated by specific inputs (called triggers), to trick the model into producing unexpected behaviors. In this paper, we create covert and natural triggers for textual backdoor attacks, \textit{hidden backdoors}, where triggers can fool both modern LLMs and human inspection. We deploy our hidden backdoors through two state-of-the-art trigger embedding methods. The first approach via homograph replacement, embeds the trigger into deep neural networks through the visual spoofing of lookalike character replacement. The second approach uses subtle differences between text generated by LLMs and real natural text to produce trigger sentences with correct grammar and high fluency. We demonstrate that the proposed hidden backdoors can be effective across three downstream security-critical NLP tasks, representative of modern human-centric NLP systems, including toxic comment detection, neural machine translation (NMT), and question answering (QA). Our two hidden backdoor attacks can achieve an Attack Success Rate (ASR) of at least $97\%$ with an injection rate of only $3\%$ in toxic comment detection, $95.1\%$ ASR in NMT with less than $0.5\%$ injected data, and finally $91.12\%$ ASR against QA updated with only 27 poisoning data samples on a model previously trained with 92,024 samples (0.029\%). We are able to demonstrate the adversary's high success rate of attacks, while maintaining functionality for regular users, with triggers inconspicuous by the human administrators.

PDF Abstract

Essay on "Hidden Backdoors in Human-Centric LLMs"

The paper "Hidden Backdoors in Human-Centric LLMs" presents a rigorous exploration of backdoor attacks on modern NLP systems, emphasizing their potential stealth and efficacy. The authors focus on the vulnerabilities of NLP systems to backdoor attacks, which embed malicious features—termed backdoors—that are triggered by specific inputs. The paper introduces novel methods for embedding covert triggers, capable of deceiving not only machine models but also human reviewers, reinforcing the threat these backdoors pose in security-critical tasks.

Core Contributions and Methodology

The authors propose two innovative techniques for embedding hidden backdoors in NLP models:

Homograph Replacement: This method exploits visual spoofing through character replacements that appear visually similar but are registered as different characters by the machine. This technique effectively plants backdoors using Unicode homographs to disguise triggers in plain sight.
Dynamic Sentence Generation: Leveraging the fluency and syntactic control of state-of-the-art LLMs, such as LSTM and PPLM-based models, this approach constructs context-aware triggers. The generated sentences naturally blend into the LLM's trained environment, rendering them inconspicuous to human inspectors while still operative under machine inspection.

These methods were tested against three significant NLP tasks: toxic comment detection, neural machine translation (NMT), and question answering. In all three domains, the paper demonstrates a high attack success rate (ASR) with surprisingly low injection rates. For instance, in toxic comment detection, the hidden backdoors reached over a 97% ASR with an injection rate of only 3%.

Key Findings

Efficacy Across Tasks: The backdoors effectively compromised models across diverse NLP tasks, indicating a profound generalizability and applicability of the attack vectors proposed.
Minimal Training Data Poisoning Requirement: The backdoor attacks demonstrated success while poisoning a negligible fraction of the training dataset, sometimes as low as 0.3%.
Challenge to Human Inspectability: The homograph and dynamic sentence approaches were able to evade human detection, highlighting the novel characteristic of backdoors as both machine-indiscernible and human-stealthy.

Implications and Future Directions

The paper poses significant implications for the security of NLP systems, especially those deployed in sensitive areas where data integrity and authenticity are critical. The covert nature of these attacks implies that traditional auditing and inspection practices are insufficient, necessitating new research into detection and defense mechanisms against such sophisticated attacks.

Several open questions and potential areas for future research emerge from this work:

Development of Detection Mechanisms: There exists an urgent need to develop detection frameworks capable of identifying the nuanced indicators of backdoor presence, especially within the bounds of highly complex LLMs.
Ethical and Security Considerations: Further exploration is required into how these backdoor techniques might be mitigated or monitored by aligning the capabilities of LLMs with ethical standards and security policies.
Continued Exploration of AI Security: The evolution of AI systems, both in scale and application diversity, implies that their security contexts must be continuously reassessed, integrating new insights from studies such as this to build robust, attacker-resistant systems.

In summary, this paper makes a substantial contribution to our understanding of backdoor vulnerabilities in NLP systems. By introducing subtle yet powerful methods for backdoor injection, it challenges security paradigms in AI, pushing for innovative defenses that can safely harness the immense potentials of LLMs.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Shaofeng Li (16 papers)
Hui Liu (481 papers)
Tian Dong (19 papers)
Benjamin Zi Hao Zhao (30 papers)
Minhui Xue (72 papers)
Haojin Zhu (16 papers)
Jialiang Lu (11 papers)

Citations (128)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - lishaofeng/NLP_Backdoor: Hidden backdoor attack on NLP systems (47 stars)

Tweets

https://twitter.com/suhackerr/status/1745958648092758076