Achieving Verified Robustness to Symbol Substitutions via Interval Bound Propagation (1909.01492v2)

Published 3 Sep 2019 in cs.CL, cs.CR, cs.LG, and stat.ML

Abstract: Neural networks are part of many contemporary NLP systems, yet their empirical successes come at the price of vulnerability to adversarial attacks. Previous work has used adversarial training and data augmentation to partially mitigate such brittleness, but these are unlikely to find worst-case adversaries due to the complexity of the search space arising from discrete text perturbations. In this work, we approach the problem from the opposite direction: to formally verify a system's robustness against a predefined class of adversarial attacks. We study text classification under synonym replacements or character flip perturbations. We propose modeling these input perturbations as a simplex and then using Interval Bound Propagation -- a formal model verification method. We modify the conventional log-likelihood training objective to train models that can be efficiently verified, which would otherwise come with exponential search complexity. The resulting models show only little difference in terms of nominal accuracy, but have much improved verified accuracy under perturbations and come with an efficiently computable formal guarantee on worst case adversaries.

Citations (159)

View on Semantic Scholar

Summary

The paper introduces a formal verification approach using IBP to guarantee NLP models are robust against adversarial symbol substitutions.
It employs a novel simplex modeling technique that tightly captures the perturbation space in text, improving formal guarantee precision.
The verifiable training method shows strong performance on datasets like SST and AG News without significant losses in nominal accuracy.

Achieving Verified Robustness to Symbol Substitutions via Interval Bound Propagation

The paper "Achieving Verified Robustness to Symbol Substitutions via Interval Bound Propagation" tackles the pressing issue of ensuring robustness in neural NLP models against adversarial input perturbations, which are subtle alterations in input data designed to elicit incorrect predictions. Given the discrete and vast nature of text data, adversarial training and data augmentation have been traditionally used to address this issue. However, these methods fall short of offering formal guarantees due to their inability to cover the entire adversarial input space effectively.

Key Contributions

This research departs from the usual adversarial defense mechanisms by offering a formal verification approach to guarantee robustness against specific adversarial attacks. The authors focus on text classification tasks and define the attack models as synonym replacements and character flips. The operational core of their methodology is Interval Bound Propagation (IBP), a mathematical technique for propagating bounds through neural networks, which they adapt to verify robustness against their defined adversarial perturbations.

Simplex Modeling of Perturbations: Unlike typical image classification verifications using $L_\infty$ -balls, which result in impractical over-approximation in text, the authors propose a simplex-based approach. This method efficiently models perturbations, significantly tightening the approximation of the perturbation space in NLP contexts.
Verifiable Training Using IBP: The authors modify the standard training objective to incorporate interval bounds, thus enhancing the model’s capability to be verifiable. They propose a loss function that includes a term to penalize misclassifications under worst-case perturbations identified via IBP.
Efficient Verification and Results: This novel training procedure demonstrated substantial improvements in verified robustness without substantial losses in nominal accuracy. Experimental evaluations on datasets like SST and AG News confirmed the model's robust predictions across allowable perturbations.

Implications and Future Directions

Practically, this paper advances the ability to deploy NLP models in scenarios where robustness against adversarial attacks is paramount. The formal guarantees offered by their approach mean that models may now be confidently used without fear of unexpected behaviors resulting from minor input modifications.

Theoretically, this work paves the way for further explorations in neural network verification for discrete data types and complex natural language tasks. Future research could extend these methods to deeper and more complex architectures, such as those incorporating recurrent layers or transformer-based models, which are prevalent in modern NLP applications.

Further, the paper raises intriguing questions about the application of similar techniques to other perturbation models beyond synonym and character flips. This can include more sophisticated adversarial strategies like syntactic transformations or contextually nuanced word substitutions.

In summary, this paper offers an important step forward in the intersection of formal verification methods and adversarial robustness in NLP, setting a foundation for future advancements in the field.

PDF Markdown

Related Papers

GitHub

GitHub - google-deepmind/interval-bound-propagation: This repository contains a simple implementation of Interval Bound Propagation (IBP) using TensorFlow: https://arxiv.org/abs/1810.12715 (160 stars)