Combating Adversarial Misspellings with Robust Word Recognition (1905.11268v2)

Published 27 May 2019 in cs.CL, cs.CR, and cs.LG

Abstract: To combat adversarial spelling mistakes, we propose placing a word recognition model in front of the downstream classifier. Our word recognition models build upon the RNN semi-character architecture, introducing several new backoff strategies for handling rare and unseen words. Trained to recognize words corrupted by random adds, drops, swaps, and keyboard mistakes, our method achieves 32% relative (and 3.3% absolute) error reduction over the vanilla semi-character model. Notably, our pipeline confers robustness on the downstream classifier, outperforming both adversarial training and off-the-shelf spell checkers. Against a BERT model fine-tuned for sentiment analysis, a single adversarially-chosen character attack lowers accuracy from 90.3% to 45.8%. Our defense restores accuracy to 75%. Surprisingly, better word recognition does not always entail greater robustness. Our analysis reveals that robustness also depends upon a quantity that we denote the sensitivity.

Citations (293)

View on Semantic Scholar

Summary

The paper introduces a defense mechanism using a robust word recognition model that corrects adversarial spelling perturbations.
It employs advanced semi-character RNN architectures with backoff strategies to manage unseen or rare words.
Empirical results show that the defense restores accuracy from 45.8% to 75%, highlighting significant improvements in model robustness.

Combating Adversarial Misspellings with Robust Word Recognition

The paper "Combating Adversarial Misspellings with Robust Word Recognition" provides a thorough investigation into text classification models' susceptibility to character-level adversarial attacks and proposes a robust word recognition pre-processing step to mitigate their effects. The authors focus on adversarially-chosen spelling mistakes, including dropping, adding, swapping, and keyboard mistakes to internal characters of words. Their key innovation is enhancing the semi-character recurrent neural network (RNN) architecture with novel backoff strategies for dealing with rare and unforeseen words.

Core Contributions

Vulnerability Analysis: The paper identifies character-level adversarial attacks as a significant threat to the performance of text classification models, especially word-piece and character-level models like BERT and BiLSTM. The authors demonstrate that minimal character perturbations (often only a single character) can reduce model accuracy from over 90% to levels comparable to random guessing.
Novel Defense Mechanism: The primary defense proposed is a word recognition model, positioned ahead of the classifier, trained to predict the correct words in the presence of adversarial perturbations. This model builds on RNN-based semi-character architectures, incorporating backoff strategies to address out-of-vocabulary tokens and unseen words.
Quantitative Results: The empirical results are striking. The defense restores BERT model accuracy from a baselined 45.8% (after an adversarial attack) back to 75%. Against other models, the proposed method consistently recovers significant portions of lost accuracy across various attack types. For example, in the word-piece input format, the robustness against 1-character attacks improves markedly when employing the proposed defense mechanisms over baseline adversarial training.
Backoff Strategies: The paper evaluates several backoff strategies. The ‘pass-through’ strategy is the simplest, forwarding unrecognized words as is. More sophisticated strategies like the ‘neutral word’ backoff reduce sensitivity by mapping all unknowns to a fixed word, enhancing robustness. Another discussed approach, the ‘background model’—trained on a larger corpus—shows potential in reducing word error rates but exhibits trade-offs concerning sensitivity.
Sensitivity Metric: The paper introduces a ‘sensitivity’ metric to quantify the propensity of a word recognition system to produce multiple unique outputs for adversarial perturbations. Models with lower sensitivity ratings are empirically found to confer greater robustness to downstream classifiers.

Implications and Future Work

The implications of integrating a robust word recognition model ahead of classifiers are significant for real-world applications where text data can be manipulated, such as in combating spam or in environments with deliberate censorship. The methodological advancements invite further research into improving the efficiency and scalability of word recognition models and exploring more sophisticated backoff strategies that balance sensitivity and error rates optimally.

Extending these techniques to other tasks within NLP or integrating them within broader adversarial training frameworks could offer new avenues for enhancing the resilience of NLP models against subtle, adversarially-introduced distributional shifts. The paper establishes a foundation, underscoring the necessity for robustness evaluations beyond conventional training data, and champions the application of psychological insights into human text comprehension within machine learning solutions.

In sum, the research addresses a critical vulnerability in NLP models, proposing a viable, task-agnostic defense that enhances robustness to adversarial attacks. The balance between computational efficiency, robustness, and adaptability is set for further exploration, promising a trajectory of developments in the secure deployment of AI models in text-heavy domains.

PDF Markdown

Combating Adversarial Misspellings with Robust Word Recognition (1905.11268v2)

Summary

Combating Adversarial Misspellings with Robust Word Recognition

Related Papers