PASA: Attack Agnostic Unsupervised Adversarial Detection using Prediction & Attribution Sensitivity Analysis (2404.10789v1)
Abstract: Deep neural networks for classification are vulnerable to adversarial attacks, where small perturbations to input samples lead to incorrect predictions. This susceptibility, combined with the black-box nature of such networks, limits their adoption in critical applications like autonomous driving. Feature-attribution-based explanation methods provide relevance of input features for model predictions on input samples, thus explaining model decisions. However, we observe that both model predictions and feature attributions for input samples are sensitive to noise. We develop a practical method for this characteristic of model prediction and feature attribution to detect adversarial samples. Our method, PASA, requires the computation of two test statistics using model prediction and feature attribution and can reliably detect adversarial samples using thresholds learned from benign samples. We validate our lightweight approach by evaluating the performance of PASA on varying strengths of FGSM, PGD, BIM, and CW attacks on multiple image and non-image datasets. On average, we outperform state-of-the-art statistical unsupervised adversarial detectors on CIFAR-10 and ImageNet by 14\% and 35\% ROC-AUC scores, respectively. Moreover, our approach demonstrates competitive performance even when an adversary is aware of the defense mechanism.
- Detecting adversarial examples and other misclassifications in neural networks by introspection. CoRR, abs/1905.09186, 2019.
- Ember: an open dataset for training static pe malware machine learning models. arXiv preprint arXiv:1804.04637, 2018.
- Modeling realistic adversarial attacks against network intrusion detection systems. Digital Threats: Research and Practice (DTRAP), 3(3):1–19, 2022.
- On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.
- Sok: Modeling explainability in security analytics for interpretability, trustworthiness, and usability. In The 18th International Conference on Availability, Reliability and Security (ARES 2023), 2023.
- Wild patterns: Ten years after the rise of adversarial machine learning. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pages 2154–2156, 2018.
- End to end learning for self-driving cars. CoRR, abs/1604.07316, 2016.
- Proper network interpretability helps adversarial robustness in classification. In International Conference on Machine Learning, pages 1014–1023. PMLR, 2020.
- Thermometer encoding: One hot way to resist adversarial examples. In International conference on learning representations, 2018.
- Noisegrad—enhancing explanations by introducing stochasticity to model weights. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 6132–6140, 2022.
- Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM workshop on artificial intelligence and security, pages 3–14, 2017.
- Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pages 39–57. Ieee, 2017.
- Concise explanations of neural networks using adversarial training. In International Conference on Machine Learning, pages 1383–1391. PMLR, 2020.
- Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International conference on machine learning, pages 2206–2216. PMLR, 2020.
- Shield: Fast, practical defense and vaccination for deep learning using jpeg compression. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 196–204, 2018.
- The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pages 233–240, 2006.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Troubleshooting an intrusion detection dataset: the cicids2017 case study. In 2021 IEEE Security and Privacy Workshops (SPW), pages 7–12. IEEE, 2021.
- Detecting adversarial samples from artifacts. CoRR, abs/1703.00410, 2017.
- Deep manifold traversal: Changing labels with convolutional features. CoRR, abs/1511.06421, 2015.
- Interpretation of neural networks is fragile. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 3681–3688, 2019.
- Explaining and harnessing adversarial examples. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
- P. doll ar, and r. girshick,“mask r-cnn,”. In Proc. IEEE Int. Conf. Comput. Vis, pages 2980–2988, 2017.
- Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- A baseline for detecting misclassified and out-of-distribution examples in neural networks. International Conference on Learning Representations (ICLR), 2016.
- Are odds really odd? bypassing statistical detection of adversarial examples. arXiv preprint arXiv:1907.12138, 2019.
- A new defense against adversarial images: Turning a weakness into a strength. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 1633–1644, 2019.
- Kaggle. Imagenet 1000 (mini). https://rb.gy/udu0a, 2020.
- Captum: A unified and generic model interpretability library for pytorch, 2020.
- Learning multiple layers of features from tiny images. 2009.
- Black-box adversarial attacks in autonomous vehicle technology. In 2020 IEEE Applied Imagery Pattern Recognition Workshop (AIPR), pages 1–7. IEEE, 2020.
- Adversarial examples in the physical world. In Artificial intelligence safety and security, pages 99–112. Chapman and Hall/CRC, 2018.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Understanding neural networks through representation erasure. CoRR, abs/1612.08220, 2016.
- Delving into transferable adversarial examples and black-box attacks. In International Conference on Learning Representations, 2016.
- A unified approach to interpreting model predictions. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 4765–4774, 2017.
- Nic: Detecting adversarial samples with neural network invariant checking. In 26th Annual Network And Distributed System Security Symposium (NDSS 2019). Internet Soc, 2019.
- Characterizing adversarial subspaces using local intrinsic dimensionality. In International Conference on Learning Representations, 2018.
- Towards deep learning models resistant to adversarial attacks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018.
- Magnet: a two-pronged defense against adversarial examples. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pages 135–147, 2017.
- Adversarial robustness toolbox v1.2.0. CoRR, 1807.01069, 2018.
- Improving adversarial robustness via promoting ensemble diversity. In International Conference on Machine Learning, pages 4970–4979. PMLR, 2019.
- Indicators of attack failure: Debugging and improving optimization of adversarial examples. Advances in Neural Information Processing Systems, 35:23063–23076, 2022.
- ” why should i trust you?” explaining the predictions of any classifier. In 22nd ACM SIGKDD, 2016.
- The odds are odd: A statistical test for detecting adversarial examples. In International Conference on Machine Learning, pages 5498–5507. PMLR, 2019.
- łlearning internal representations by error propagation, ž california univ san diego la jolla inst for cognitive science. Technical report, Tech. Rep, 1985.
- Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
- Are adversarial examples inevitable? In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
- Toward generating a new intrusion detection dataset and intrusion traffic characterization. ICISSp, 1:108–116, 2018.
- On the robustness of domain constraints. In Proceedings of the 2021 ACM SIGSAC conference on computer and communications security, pages 495–515, 2021.
- Deep learning in medical image analysis. Annual review of biomedical engineering, 19:221–248, 2017.
- Deep inside convolutional networks: Visualising image classification models and saliency maps. In Yoshua Bengio and Yann LeCun, editors, 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Workshop Track Proceedings, 2014.
- Very deep convolutional networks for large-scale image recognition. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
- Deep neural rejection against adversarial examples. EURASIP Journal on Information Security, 2020:1–10, 2020.
- Dla: dense-layer-analysis for adversarial example detection. In 2020 IEEE European Symposium on Security and Privacy (EuroS&P), pages 198–215. IEEE, 2020.
- Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
- Visualizing the impact of feature attribution baselines. Distill, 5(1):e22, 2020.
- Towards hiding adversarial examples from network interpretation. arXiv preprint arXiv:1812.02843, 2018.
- Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319–3328. PMLR, 2017.
- Intriguing properties of neural networks. In Yoshua Bengio and Yann LeCun, editors, 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
- A boundary tilting persepective on the phenomenon of adversarial examples. CoRR, abs/1608.07690, 2016.
- Attacks meet interpretability: Attribute-steered detection of adversarial samples. Advances in Neural Information Processing Systems, 31, 2018.
- Florian Tramèr. Detecting adversarial examples is (nearly) as hard as classifying them. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 21692–21702. PMLR, 2022.
- On adaptive attacks to adversarial example defenses. Advances in neural information processing systems, 33:1633–1645, 2020.
- Visualizing high-dimensional data using t-sne. 2008. Journal of Machine Learning Research, 9:2579.
- Exad: An ensemble approach for explanation-based adversarial detection. arXiv preprint arXiv:2103.11526, 2021.
- Interpretability is a kind of safety: An interpreter-based ensemble for adversary defense. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 15–24, 2020.
- Beating attackers at their own games: Adversarial example detection using adversarial gradient directions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 2969–2977, 2021.
- Feature denoising for improving adversarial robustness. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 501–509, 2019.
- Feature squeezing: Detecting adversarial examples in deep neural networks. In 25th Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, February 18-21, 2018. The Internet Society, 2018.
- Ml-loo: Detecting adversarial examples with feature attribution. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 6639–6647, 2020.
- Detecting adversarial perturbations with salieny. In Proceedings of the 6th International Conference on Information Technology: IoT and Smart City, pages 25–30, 2018.
- The limitations of adversarial training and the blind-spot attack. In International Conference on Learning Representations, 2018.
- Interpretable deep learning under fire. In Srdjan Capkun and Franziska Roesner, editors, 29th USENIX Security Symposium, USENIX Security 2020, August 12-14, 2020, pages 1659–1676. USENIX Association, 2020.