Bad Characters: Imperceptible NLP Attacks (2106.09898v2)

Published 18 Jun 2021 in cs.CL, cs.CR, and cs.LG

Abstract: Several years of research have shown that machine-learning systems are vulnerable to adversarial examples, both in theory and in practice. Until now, such attacks have primarily targeted visual models, exploiting the gap between human and machine perception. Although text-based models have also been attacked with adversarial examples, such attacks struggled to preserve semantic meaning and indistinguishability. In this paper, we explore a large class of adversarial examples that can be used to attack text-based models in a black-box setting without making any human-perceptible visual modification to inputs. We use encoding-specific perturbations that are imperceptible to the human eye to manipulate the outputs of a wide range of NLP systems from neural machine-translation pipelines to web search engines. We find that with a single imperceptible encoding injection -- representing one invisible character, homoglyph, reordering, or deletion -- an attacker can significantly reduce the performance of vulnerable models, and with three injections most models can be functionally broken. Our attacks work against currently-deployed commercial systems, including those produced by Microsoft and Google, in addition to open source models published by Facebook, IBM, and HuggingFace. This novel series of attacks presents a significant threat to many language processing systems: an attacker can affect systems in a targeted manner without any assumptions about the underlying model. We conclude that text-based NLP systems require careful input sanitization, just like conventional applications, and that given such systems are now being deployed rapidly at scale, the urgent attention of architects and operators is required.

Authors (4)

Nicholas Boucher (8 papers)
Ilia Shumailov (72 papers)
Ross Anderson (46 papers)
Nicolas Papernot (123 papers)

Citations (93)

View on Semantic Scholar

Summary

The paper introduces imperceptible character perturbations, such as zero-width spaces and homoglyphs, that degrade NLP outputs while preserving visual text.
It employs differential evolution for black-box optimization, demonstrating the attack's effectiveness across various systems like Google Translate.
The findings highlight a potential denial-of-service impact, urging robust preprocessing defenses against these subtle encoding vulnerabilities.

Imperceptible NLP Attacks Explored: A Structured Dissection

The paper entitled "Bad Characters: Imperceptible NLP Attacks" by Nicholas Boucher, Ilia Shumailov, Ross Anderson, and Nicolas Papernot, provides a rigorous examination of a novel class of adversarial attacks targeting NLP systems. This exploration is significant within adversarial machine learning, expanding beyond traditional visual model attacks to the intricate domain of text-based models.

The research identifies and categorizes a series of attacks leveraging encoding-specific perturbations that remain visually imperceptible to human users but substantially degrade the performance of various NLP systems. Such attacks elicit errors within models via imperceptible manipulation—using invisible characters, homoglyphs, reordering, and deletion techniques—without altering the semantic content as perceived by humans.

Experimental Setup and Results

This paper approaches the crafting of adversarial examples through a robust optimization framework, utilizing differential evolution as a gradient-free method suitable for black-box settings. The attacks, validated across multiple NLP models including machine translation, content detection, and classification systems, show a consistent capability to undermine system integrity and availability.

Key findings include:

Imperceptible Character Attacks: The introduction of encoding-based perturbations, such as zero-width spaces, homoglyphs, and control characters, effectively disrupts NLP model output while leaving the rendered text visually unchanged.
Effectiveness across Platforms: The described attacks successfully target both open-source (e.g., Fairseq models) and commercial NLP offerings (like Google Translate and Microsoft's text services), highlighting a broad vulnerability across different implementations and deployments.
Cost of Availability Attacks: Sponge examples crafted with these imperceptible techniques significantly slow down inference times, indicating potential for denial-of-service (DoS) attacks against NLP systems.
Robustness to Defenses: Many systems lack adequate defenses, particularly against the novel inclusion of invisible and reordering perturbations. Proposed countermeasures, such as stripping out non-visible characters or pre-processing with Optical Character Recognition (OCR), vary in effectiveness and associated computational overhead.

Theoretical and Practical Implications

From a theoretical standpoint, this research provides compelling evidence of the susceptibility of NLP systems to subtle input transformations, underscoring the tension between encoding flexibility and security. The implications are acute for cybersecurity; these attacks could serve as a basis for manipulations ranging from evading content moderation systems to undermining the validity of machine translation outputs.

Practically, the attacks demand attention from practitioners deploying machine learning models at scale. Robust defenses should be instituted, potentially involving meticulous input sanitization processes or the use of OCR. Additionally, this raises questions about the standard practices in training and deploying machine learning models, suggesting a need for revising both input processing and model architecture to withstand such perturbations.

Future Directions

This paper opens several pathways for future research. One compelling avenue is the exploration of more adaptive and context-sensitive defenses that can dynamically respond to perturbations without significant efficiency trade-offs. Moreover, as this domain continues to evolve, the implications of similar attacks in more complex LLMs and multi-modal systems warrant further exploration. Future investigations might also evaluate the transferability of these attacks across diverse model architectures and languages beyond those tested.

In sum, the paper not only elucidates a critical vulnerability in current NLP systems but also provides a foundation for developing more resilient machine learning architectures, extending the discourse within adversarial examples from vision-based systems to the nuanced field of natural language processing.

Related Papers

GitHub

GitHub - nickboucher/imperceptible: Bad Characters: Imperceptible NLP Attacks (34 stars)

Tweets

https://twitter.com/iliaishacked/status/1745929303433380341

https://twitter.com/yvesalexandre/status/1843711003256861015