Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

"That Is a Suspicious Reaction!": Interpreting Logits Variation to Detect NLP Adversarial Attacks (2204.04636v2)

Published 10 Apr 2022 in cs.AI, cs.CL, and cs.LG

Abstract: Adversarial attacks are a major challenge faced by current machine learning research. These purposely crafted inputs fool even the most advanced models, precluding their deployment in safety-critical applications. Extensive research in computer vision has been carried to develop reliable defense strategies. However, the same issue remains less explored in natural language processing. Our work presents a model-agnostic detector of adversarial text examples. The approach identifies patterns in the logits of the target classifier when perturbing the input text. The proposed detector improves the current state-of-the-art performance in recognizing adversarial inputs and exhibits strong generalization capabilities across different NLP models, datasets, and word-level attacks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Jonathan Aigrain and Marcin Detyniecki. 2019. Detecting adversarial examples and other misclassifications in neural networks by introspection. arXiv preprint arXiv:1905.09186.
  2. Basemah Alshemali and Jugal Kalita. 2019. Towards mitigating adversarial texts. International Journal of Computer Applications, 178(50):1–7.
  3. Generating natural language adversarial examples. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2890–2896. Association for Computational Linguistics.
  4. Synthesizing robust adversarial examples. In International conference on machine learning, pages 284–293. PMLR.
  5. Leo Breiman. 2001. Random forests. Machine Learning, 45(1):5–32.
  6. Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, page 785–794. Association for Computing Machinery.
  7. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
  8. Towards robustness against natural language word substitutions. In 9th International Conference on Learning Representations (ICLR).
  9. HotFlip: White-box adversarial examples for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 31–36. Association for Computational Linguistics.
  10. Text processing like humans do: Visually attacking and shielding NLP systems. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1634–1647. Association for Computational Linguistics.
  11. When explainability meets adversarial learning: Detecting adversarial examples using shap signatures. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8.
  12. Black-box generation of adversarial text sequences to evade deep learning classifiers. In 2018 IEEE Security and Privacy Workshops (SPW), pages 50–56.
  13. Siddhant Garg and Goutham Ramakrishnan. 2020. Bae: Bert-based adversarial examples for text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6174–6181.
  14. Explaining and harnessing adversarial examples. In International Conference on Learning Representations (ICLR).
  15. Support vector machines. IEEE Intelligent Systems and their Applications, 13(4):18–28.
  16. Dan Hendrycks and Kevin Gimpel. 2016. Early methods for detecting adversarial images. arXiv preprint arXiv:1608.00530.
  17. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780.
  18. Adversarial example generation with syntactically controlled paraphrase networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1875–1885. Association for Computational Linguistics.
  19. Certified robustness to adversarial word substitutions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4129–4142. Association for Computational Linguistics.
  20. Is bert really robust? a strong baseline for natural language attack on text classification and entailment. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):8018–8025.
  21. Adversarial logit pairing. arXiv preprint arXiv:1803.06373.
  22. Lightgbm: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  23. Adversarial examples in the physical world. ICLR Workshop.
  24. BERT-ATTACK: Adversarial attack against BERT using BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6193–6202. Association for Computational Linguistics.
  25. Scott M Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 4765–4774. Curran Associates, Inc.
  26. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 142–150. Association for Computational Linguistics.
  27. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 119–126.
  28. Understanding and interpreting the impact of user context in hate speech detection. In Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media, pages 91–102, Online. Association for Computational Linguistics.
  29. Frequency-guided word substitutions for detecting textual adversarial examples. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 171–186. Association for Computational Linguistics.
  30. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 115–124. Association for Computational Linguistics.
  31. Towards robust detection of adversarial examples. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 4584–4594, Red Hook, NY, USA. Curran Associates Inc.
  32. Combating adversarial misspellings with robust word recognition. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5582–5591. Association for Computational Linguistics.
  33. Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1085–1097. Association for Computational Linguistics.
  34. The odds are odd: A statistical test for detecting adversarial examples. In International Conference on Machine Learning, pages 5498–5507. PMLR.
  35. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  36. Robert E. Schapire. 1999. A brief introduction to boosting. In Proceedings of the 16th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’99, page 1401–1406. Morgan Kaufmann Publishers Inc.
  37. Jaswinder Singh and Rajdeep Banerjee. 2019. A study on single and multi-layer perceptron neural network. In 2019 3rd International Conference on Computing Methodologies and Communication (ICCMC), pages 35–40.
  38. Attacks meet interpretability: Attribute-steered detection of adversarial samples. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 7728–7739. Curran Associates Inc.
  39. On adaptive attacks to adversarial example defenses. Advances in Neural Information Processing Systems, 33:1633–1645.
  40. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  41. Natural language adversarial attacks and defenses in word level. arXiv preprint arXiv:1909.06723.
  42. Model-agnostic adversarial example detection through logit distribution learning. In 2021 IEEE International Conference on Image Processing (ICIP), pages 3617–3621.
  43. Explainable abusive language classification leveraging user and network data. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 481–496. Springer.
  44. Detection defense against adversarial attacks with saliency map. arXiv preprint arXiv:2009.02738.
  45. Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Trans. Intell. Syst. Technol., 11(3).
  46. Character-level convolutional networks for text classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 649–657. MIT Press.
  47. Defense against adversarial attacks in nlp via dirichlet neighborhood ensemble. arXiv preprint arXiv:2006.11627.
  48. Learning to discriminate perturbations for blocking adversarial attacks in text classification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4904–4913. Association for Computational Linguistics.
Citations (26)

Summary

We haven't generated a summary for this paper yet.