Fast Adversarial Training against Textual Adversarial Attacks (2401.12461v1)
Abstract: Many adversarial defense methods have been proposed to enhance the adversarial robustness of natural language processing models. However, most of them introduce additional pre-set linguistic knowledge and assume that the synonym candidates used by attackers are accessible, which is an ideal assumption. We delve into adversarial training in the embedding space and propose a Fast Adversarial Training (FAT) method to improve the model robustness in the synonym-unaware scenario from the perspective of single-step perturbation generation and perturbation initialization. Based on the observation that the adversarial perturbations crafted by single-step and multi-step gradient ascent are similar, FAT uses single-step gradient ascent to craft adversarial examples in the embedding space to expedite the training process. Based on the observation that the perturbations generated on the identical training sample in successive epochs are similar, FAT fully utilizes historical information when initializing the perturbation. Extensive experiments demonstrate that FAT significantly boosts the robustness of BERT models in the synonym-unaware scenario, and outperforms the defense baselines under various attacks with character-level and word-level modifications.
- Generating natural language adversarial examples. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2890–2896.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186.
- Towards robustness against natural language word substitutions. In 9th International Conference on Learning Representations.
- Hotflip: White-box adversarial examples for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 31–36. Association for Computational Linguistics.
- Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops, pages 50–56.
- Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations.
- Gradient-based adversarial attacks against text transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5747–5757.
- Maor Ivgi and Jonathan Berant. 2021. Achieving model robustness through discrete adversarial training. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1529–1544.
- Certified robustness to adversarial word substitutions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 4129–4142.
- Is BERT really robust? A strong baseline for natural language attack on text classification and entailment. In The Thirty-Fourth AAAI Conference on Artificial Intelligence 2020, pages 8018–8025.
- Textbugger: Generating adversarial text against real-world applications. In 26th Annual Network and Distributed System Security Symposium.
- BERT-ATTACK: adversarial attack against BERT using BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 6193–6202.
- Linyang Li and Xipeng Qiu. 2021. Token-aware virtual adversarial training in natural language understanding. In Thirty-Fifth AAAI Conference on Artificial Intelligence, pages 8410–8418.
- Searching for an effective defender: Benchmarking defense against adversarial word substitution. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3137–3147.
- Flooding-X: Improving BERT’s resistance to adversarial attacks via loss-restricted fine-tuning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pages 5634–5644.
- Adversarial training for large neural language models. CoRR, abs/2004.08994.
- Learning word vectors for sentiment analysis. In The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, pages 142–150.
- Towards deep learning models resistant to adversarial attacks. In 6th International Conference on Learning Representations.
- Adversarial training methods for semi-supervised text classification. In 5th International Conference on Learning Representations.
- Frequency-guided word substitutions for detecting textual adversarial examples. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 171–186.
- Crafting adversarial input sequences for recurrent neural networks. In 2016 IEEE Military Communications Conference, pages 49–54.
- Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pages 1085–1097.
- Intriguing properties of neural networks. In 2nd International Conference on Learning Representations.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations.
- T3: Tree-autoencoder constrained adversarial text generation for targeted attack. In Conference on Empirical Methods in Natural Language Processing, pages 6134–6150.
- Infobert: Improving robustness of language models from an information theoretic perspective. In 9th International Conference on Learning Representations.
- Natural language adversarial defense through synonym encoding. In Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, pages 823–833.
- Adversarial training with fast gradient projection method against synonym substitution based text attacks. In Thirty-Fifth AAAI Conference on Artificial Intelligence, pages 13997–14005.
- Robustness-aware word embedding improves certified robustness to adversarial word substitutions. In Findings of Association for Computational Linguistics.
- Robust textual embedding against word-level adversarial attacks. In Uncertainty in Artificial Intelligence, Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, pages 2214–2224.
- Safer: A structure-free approach for certified robustness to adversarial word substitutions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3465–3475.
- Word-level textual adversarial attacking as combinatorial optimization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6066–6080.
- Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, pages 649–657.
- Certified robustness against natural language attacks by causal intervention. In International Conference on Machine Learning, pages 26958–26970.
- Freelb: Enhanced adversarial training for natural language understanding. In 8th International Conference on Learning Representations.