Manipulating Feature Visualizations with Gradient Slingshots (2401.06122v2)
Abstract: Deep Neural Networks (DNNs) are capable of learning complex and versatile representations, however, the semantic nature of the learned concepts remains unknown. A common method used to explain the concepts learned by DNNs is Feature Visualization (FV), which generates a synthetic input signal that maximally activates a particular neuron in the network. In this paper, we investigate the vulnerability of this approach to adversarial model manipulations and introduce a novel method for manipulating FV without significantly impacting the model's decision-making process. The key distinction of our proposed approach is that it does not alter the model architecture. We evaluate the effectiveness of our method on several neural network models and demonstrate its capabilities to hide the functionality of arbitrarily chosen neurons by masking the original explanations of neurons with chosen target explanations during model auditing.
- From attribution maps to human-understandable explanations through concept relevance propagation. Nature Machine Intelligence, 5(9):1006–1019, 2023.
- Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE access, 6:52138–52160, 2018.
- Sanity checks for saliency maps. In Advances in Neural Information Proccessing Systems (NIPS), 2018.
- Fairwashing explanations with off-manifold detergent. In International Conference on Machine Learning, pages 314–323. PMLR, 2020.
- On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140, 2015.
- The shattered gradients problem: If resnets are the answer, then what is the question? In Proceedings of the 34th International Conference on Machine Learning, pages 342–350. PMLR, 2017.
- Network dissection: Quantifying interpretability of deep visual representations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6541–6549, 2017.
- Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale, 2022.
- Natural images are more informative for interpreting cnn activations than state-of-the-art synthetic feature visualizations. In NeurIPS 2020 Workshop SVRHM, 2020.
- DORA: Exploring outlier representations in deep neural networks. Transactions on Machine Learning Research, 2023a.
- Finding spurious correlations with function-semantic contrast analysis. In World Conference on Explainable Artificial Intelligence, pages 549–572. Springer, 2023b.
- Labeling neural representations with inverse recognition. In Thirty-seventh Conference on Neural Information Processing Systems, 2023c.
- Thread: Circuits. Distill, 2020. https://distill.pub/2020/circuits.
- Red teaming deep neural networks with feature synthesis tools. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Towards interpretability of segmentation networks by analyzing deepdreams. In Interpretability of Machine Intelligence in Medical Image Computing and Multimodal Learning for Clinical Decision Support: Second International Workshop, iMIMIC 2019, and 9th International Workshop, ML-CDS 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 17, 2019, Proceedings 9, pages 56–63. Springer, 2019.
- Traffic sign recognition and analysis for intelligent vehicles. Image Vis. Comput., 21(3):247–258, 2003.
- Mayukh Deb. Feature visualization library for pytorch. https://github.com/Mayukhdeb/torch-dreams, 2021.
- Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
- Explanations can be manipulated and geometry is to blame. Advances in neural information processing systems, 32, 2019.
- The power of depth for feedforward neural networks. In 29th Annual Conference on Learning Theory, pages 907–940, Columbia University, New York, New York, USA, 2016. PMLR.
- Visualizing higher-layer features of a deep network. University of Montreal, 1341(3):1, 2009.
- Unlocking feature visualization for deeper networks with magnitude constrained optimization. arXiv preprint arXiv:2306.06805, 2023a.
- Craft: Concept recursive activation factorization for explainability. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2711–2721, 2023b.
- Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020.
- Don’t trust your eyes: on the (un) reliability of feature visualizations. arXiv preprint arXiv:2306.04719, 2023.
- Interpretation of neural networks is fragile. In AAAI Conference on Artificial Intelligence, 2017.
- Multimodal neurons in artificial neural networks. Distill, 6(3):e30, 2021.
- Visualizing the diversity of representations learned by bayesian neural networks. Transactions on Machine Learning Research, 2023.
- Fooling neural network interpretations via adversarial model manipulation. Advances in Neural Information Processing Systems, 32, 2019.
- Natural language descriptions of deep visual features. In International Conference on Learning Representations, 2022.
- Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
- On feature learning in the presence of spurious correlations. Advances in Neural Information Processing Systems, 35:38516–38532, 2022.
- Identifying interpretable subspaces in image representations. 2023.
- The (un)reliability of saliency methods. In Explainable AI, 2017.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
- Understanding black-box predictions via influence functions. In Proc. of the International Conference on Machine Learning (ICML), 2017.
- Learning multiple layers of features from tiny images. Toronto, ON, Canada, 2009.
- Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2012.
- Unmasking clever hans predictors and assessing what machines really learn. Nature communications, 10(1):1096, 2019.
- Deep learning. nature, 521(7553):436–444, 2015.
- Building data-driven models with microstructural images: Generalization and interpretability. Materials Discovery, 10:19–28, 2017.
- Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019a.
- Decoupled weight decay regularization. International Conference on Learning Representations, 2019b.
- Adversarial attacks on feature visualization methods. In NeurIPS ML Safety Workshop, 2022.
- Differentiable image parameterizations. Distill, 3(7):e12, 2018.
- Compositional explanations of neurons. Advances in Neural Information Processing Systems, 33:17153–17163, 2020.
- Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. Advances in neural information processing systems, 29, 2016.
- Understanding neural networks via feature visualization: A survey. Explainable AI: interpreting, explaining and visualizing deep learning, pages 55–76, 2019.
- Feature visualization. Distill, 2017. https://distill.pub/2017/feature-visualization.
- A survey on deep learning in medicine: Why, how and when? Information Fusion, 66:111–137, 2021.
- Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958.
- Explainable AI: interpreting, explaining and visualizing deep learning. Springer Nature, 2019.
- Learning important features through propagating activation differences. In Proc. of the International Conference on Machine Learning (ICML), pages 3145–3153, 2017.
- Very deep convolutional networks for large-scale image recognition. International Conference on Learning Representations, 2015.
- Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
- Kenneth O Stanley. Compositional pattern producing networks: A novel abstraction of development. Genetic programming and evolvable machines, 8:131–162, 2007.
- Axiomatic attribution for deep networks. In International conference on machine learning, pages 3319–3328. PMLR, 2017.
- Tensorflow. lucid. https://github.com/tensorflow/lucid, 2017.
- Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022.
- Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
- Toward understanding acceleration-based activity recognition neural networks with activation maximization. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2021.
- Understanding deep learning (still) requires rethinking generalization. Commun. ACM, 64(3):107–115, 2021.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.