Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Extreme Miscalibration and the Illusion of Adversarial Robustness (2402.17509v3)

Published 27 Feb 2024 in cs.CL

Abstract: Deep learning-based NLP models are vulnerable to adversarial attacks, where small perturbations can cause a model to misclassify. Adversarial Training (AT) is often used to increase model robustness. However, we have discovered an intriguing phenomenon: deliberately or accidentally miscalibrating models masks gradients in a way that interferes with adversarial attack search methods, giving rise to an apparent increase in robustness. We show that this observed gain in robustness is an illusion of robustness (IOR), and demonstrate how an adversary can perform various forms of test-time temperature calibration to nullify the aforementioned interference and allow the adversarial attack to find adversarial examples. Hence, we urge the NLP community to incorporate test-time temperature scaling into their robustness evaluations to ensure that any observed gains are genuine. Finally, we show how the temperature can be scaled during \textit{training} to improve genuine robustness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Temperature check: theory and practice for training models with softmax-cross-entropy losses. CoRR, abs/2010.07344.
  2. Generating natural language adversarial examples. pages 2890–2896.
  3. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. CoRR, abs/1802.00420.
  4. R. P. Brent. 1971. An algorithm with guaranteed convergence for finding a zero of a function. The Computer Journal, 14(4):422–425.
  5. On the intriguing connections of regularization, input gradients and transferability of evasion and poisoning attacks. CoRR, abs/1809.02861.
  6. BERT: Pre-training of deep bidirectional transformers for language understanding. pages 4171–4186.
  7. Towards robustness against natural language word substitutions. CoRR, abs/2107.13541.
  8. Black-box generation of adversarial text sequences to evade deep learning classifiers. CoRR, abs/1801.04354.
  9. Siddhant Garg and Goutham Ramakrishnan. 2020. BAE: BERT-based adversarial examples for text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6174–6181, Online. Association for Computational Linguistics.
  10. Explaining and harnessing adversarial examples.
  11. A survey of adversarial defences and robustness in nlp.
  12. On calibration of modern neural networks. pages 1321–1330.
  13. The elements of Statistical Learning: Data Mining, Inference, and prediction. Springer.
  14. Deberta: Decoding-enhanced BERT with disentangled attention. CoRR, abs/2006.03654.
  15. Robin Jia and Percy Liang. 2017. Adversarial examples for evaluating reading comprehension systems. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2021–2031, Copenhagen, Denmark. Association for Computational Linguistics.
  16. Is BERT really robust? natural language attack on text classification and entailment. CoRR, abs/1907.11932.
  17. J. Kiefer. 1953. Sequential minimax search for a maximum. Proceedings of the American Mathematical Society, 4(3):502–506.
  18. Finding actual descent directions for adversarial training. In The Eleventh International Conference on Learning Representations.
  19. Textbugger: Generating adversarial text against real-world applications. CoRR, abs/1812.05271.
  20. BERT-ATTACK: Adversarial attack against BERT using BERT. pages 6193–6202.
  21. Linyang Li and Xipeng Qiu. 2020. Textat: Adversarial training for natural language understanding with token-level perturbation. CoRR, abs/2004.14543.
  22. Searching for an effective defender: Benchmarking defense against adversarial word substitution.
  23. Searching for an effective defender: Benchmarking defense against adversarial word substitution. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3137–3147, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  24. Roberta: A robustly optimized bert pretraining approach.
  25. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations.
  26. Towards deep learning models resistant to adversarial attacks.
  27. The Stanford CoreNLP natural language processing toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60, Baltimore, Maryland. Association for Computational Linguistics.
  28. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 119–126.
  29. Obtaining well calibrated probabilities using bayesian binning. Proceedings of the … AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, 2015:2901–2907.
  30. Dang Nguyen Minh and Anh Tuan Luu. 2022. Textual manifold-based defense against natural language adversarial examples. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6612–6625, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  31. Alexandru Niculescu-Mizil and Rich Caruana. 2005. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, ICML ’05, page 625–632, New York, NY, USA. Association for Computing Machinery.
  32. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the ACL.
  33. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, ASIA CCS ’17, page 506–519, New York, NY, USA. Association for Computing Machinery.
  34. Distillation as a defense to adversarial perturbations against deep neural networks. CoRR, abs/1511.04508.
  35. Understanding the exploding gradient problem. CoRR, abs/1211.5063.
  36. John Platt and Nikos Karampatziakis. 2007. Probabilistic outputs for svms and comparisons to regularized likelihood methods.
  37. Vyas Raina and Mark Gales. 2023. Sample attackability in natural language adversarial attacks.
  38. Generating natural language adversarial examples through probability weighted word saliency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1085–1097, Florence, Italy. Association for Computational Linguistics.
  39. Adversarial training with contrastive learning in nlp.
  40. Adversarial training should be cast as a non-zero-sum game.
  41. CARER: Contextualized affect representations for emotion recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3687–3697, Brussels, Belgium. Association for Computational Linguistics.
  42. Intriguing properties of neural networks.
  43. Samson Tan and Shafiq Joty. 2021. Code-mixing on sesame street: Dawn of the adversarial polyglots. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3596–3616, Online. Association for Computational Linguistics.
  44. It’s morphin’ time! Combating linguistic discrimination with inflectional perturbations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2920–2935, Online. Association for Computational Linguistics.
  45. Ensemble adversarial training: Attacks and defenses. In International Conference on Learning Representations.
  46. Attention is all you need. CoRR, abs/1706.03762.
  47. Glue: A multi-task benchmark and analysis platform for natural language understanding.
  48. Infobert: Improving robustness of language models from an information theoretic perspective. CoRR, abs/2010.02329.
  49. Deep adversarial learning for NLP. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pages 1–5, Minneapolis, Minnesota. Association for Computational Linguistics.
  50. Natural language adversarial attacks and defenses in word level. CoRR, abs/1909.06723.
  51. Jin Yong Yoo and Yanjun Qi. 2021. Towards improving adversarial training of NLP models. CoRR, abs/2109.00544.
  52. Bianca Zadrozny and Charles Elkan. 2001. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, page 609–616, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.
  53. Bianca Zadrozny and Charles Elkan. 2002. Transforming classifier scores into accurate multiclass probability estimates. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, page 694–699, New York, NY, USA. Association for Computing Machinery.
  54. Character-level convolutional networks for text classification. In NIPS.
  55. Defense against adversarial attacks in NLP via dirichlet neighborhood ensemble. CoRR, abs/2006.11627.
  56. Freelb: Enhanced adversarial training for natural language understanding.

Summary

We haven't generated a summary for this paper yet.