Towards White Box Deep Learning (2403.09863v5)
Abstract: Deep neural networks learn fragile "shortcut" features, rendering them difficult to interpret (black box) and vulnerable to adversarial attacks. This paper proposes semantic features as a general architectural solution to this problem. The main idea is to make features locality-sensitive in the adequate semantic topology of the domain, thus introducing a strong regularization. The proof of concept network is lightweight, inherently interpretable and achieves almost human-level adversarial test metrics - with no adversarial training! These results and the general nature of the approach warrant further research on semantic features. The code is available at https://github.com/314-Foundation/white-box-nn
- Decision-based adversarial attacks: Reliable attacks against black-box machine learning models, 2018. arXiv:1712.04248.
- Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks, 2020. arXiv:2003.01690.
- Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, November 2020. URL: http://dx.doi.org/10.1038/s42256-020-00257-z, doi:10.1038/s42256-020-00257-z.
- Adversarial examples are not bugs, they are features, 2019. arXiv:1905.02175.
- Spatial transformer networks, 2016. arXiv:1506.02025.
- Why robust generalization in deep learning is difficult: Perspective of expressive power, 2022. arXiv:2205.13863.
- Kornia: an open source differentiable computer vision library for pytorch. In Winter Conference on Applications of Computer Vision, 2020. URL: https://arxiv.org/pdf/1910.02190.pdf.
- Dynamic routing between capsules, 2017. arXiv:1710.09829.
- Towards the first adversarially robust neural network model on mnist, 2018. arXiv:1805.09190.
- Intriguing properties of neural networks, 2014. arXiv:1312.6199.
- Analysis and applications of class-wise robustness in adversarial training, 2021. arXiv:2105.14240.
Collections
Sign up for free to add this paper to one or more collections.