Extreme Miscalibration and the Illusion of Adversarial Robustness (2402.17509v3)

Published 27 Feb 2024 in cs.CL

Abstract: Deep learning-based NLP models are vulnerable to adversarial attacks, where small perturbations can cause a model to misclassify. Adversarial Training (AT) is often used to increase model robustness. However, we have discovered an intriguing phenomenon: deliberately or accidentally miscalibrating models masks gradients in a way that interferes with adversarial attack search methods, giving rise to an apparent increase in robustness. We show that this observed gain in robustness is an illusion of robustness (IOR), and demonstrate how an adversary can perform various forms of test-time temperature calibration to nullify the aforementioned interference and allow the adversarial attack to find adversarial examples. Hence, we urge the NLP community to incorporate test-time temperature scaling into their robustness evaluations to ensure that any observed gains are genuine. Finally, we show how the temperature can be scaled during \textit{training} to improve genuine robustness.

References (56)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/samsontmr/status/1797449115175096756

https://twitter.com/realmofresearch/status/1799076748237402434

Extreme Miscalibration and the Illusion of Adversarial Robustness (2402.17509v3)

Summary

Related Papers

Tweets