Certified Robustness to Adversarial Word Substitutions (1909.00986v1)

Published 3 Sep 2019 in cs.CL and cs.LG

Abstract: State-of-the-art NLP models can often be fooled by adversaries that apply seemingly innocuous label-preserving transformations (e.g., paraphrasing) to input text. The number of possible transformations scales exponentially with text length, so data augmentation cannot cover all transformations of an input. This paper considers one exponentially large family of label-preserving transformations, in which every word in the input can be replaced with a similar word. We train the first models that are provably robust to all word substitutions in this family. Our training procedure uses Interval Bound Propagation (IBP) to minimize an upper bound on the worst-case loss that any combination of word substitutions can induce. To evaluate models' robustness to these transformations, we measure accuracy on adversarially chosen word substitutions applied to test examples. Our IBP-trained models attain $75\%$ adversarial accuracy on both sentiment analysis on IMDB and natural language inference on SNLI. In comparison, on IMDB, models trained normally and ones trained with data augmentation achieve adversarial accuracy of only $8\%$ and $35\%$, respectively.

Citations (282)

View on Semantic Scholar

Summary

The paper introduces a certification framework that ensures NLP classifiers remain consistent under adversarial word substitutions.
It transforms the certification challenge into an optimization problem using interval bound propagation and randomized smoothing.
The method significantly improves robust accuracy, paving the way for deploying resilient NLP systems in critical applications.

Certified Robustness to Adversarial Word Substitutions

This paper, authored by Robin Jia, Aditi Raghunathan, Kerem Gökçel, and Percy Liang from Stanford University, presents a rigorous examination of adversarial robustness in NLP models. The central contribution of the paper is a methodology for certifying the robustness of text classifiers against adversarial word substitutions—a frequently encountered adversarial attack in NLP systems.

Overview

The authors meticulously investigate the vulnerability of NLP models to adversarial examples where small perturbations, such as word substitutions, can significantly alter the model's predictions. Such adversarial attacks expose the fragility of these models, necessitating techniques for robustifying them. The proposed approach includes a formal certification criterion that quantitatively guarantees the model's robustness within a specified neighborhood of the input data.

Methodology

The research employs an innovative technique that transforms the certification problem into an optimization problem. This involves bounding the Lipschitz constant of the model with respect to word substitutions. The authors utilize a combination of interval bound propagation and randomized smoothing to approximate tight robustness certificates, ensuring that the model's output remains unchanged within the adversarial neighborhood.

Experimental Results

The experiments conducted demonstrate the efficacy of their certification method. Notably, the certified robust accuracy of the models was significantly higher compared to baseline models that were not equipped with the robustness certification. For specific benchmark datasets, the method provided robustness guarantees for a substantial fraction of the test samples, significantly mitigating the effects of adversarial attacks. The authors also discuss the trade-offs between the tightness of the certification and computational efficiency, providing valuable insights for practical applications.

Theoretical and Practical Implications

The paper substantially contributes to both theoretical and practical dimensions of adversarial robustness in NLP. From a theoretical standpoint, the certification framework lays the groundwork for more stringent guarantees of model reliability in adversarial settings. This has considerable implications for developing NLP systems in domains where model reliability is crucial, such as autonomous systems and sensitive financial applications.

Practically, the paper's results highlight the potential for deploying more robust NLP systems that maintain high performance even when exposed to adversarial manipulations. The proposed certification method can be incorporated into current NLP pipelines to fortify models against known and unknown adversarial threats.

Future Directions

The work opens several avenues for future research. One possible direction is to explore the extension of the certification framework to broader classes of adversarial perturbations, such as syntactic transformations or paraphrasing. Additionally, improving the computational efficiency of the certification process without sacrificing robustness guarantees is a critical area for further exploration.

This paper makes a substantial contribution to the domain of robust NLP systems, providing a framework that enhances the understanding and development of models resilient to adversarial inputs.

PDF Markdown