Saliency Suppressed, Semantics Surfaced: Visual Transformations in Neural Networks and the Brain (2404.18772v1)
Abstract: Deep learning algorithms lack human-interpretable accounts of how they transform raw visual input into a robust semantic understanding, which impedes comparisons between different architectures, training objectives, and the human brain. In this work, we take inspiration from neuroscience and employ representational approaches to shed light on how neural networks encode information at low (visual saliency) and high (semantic similarity) levels of abstraction. Moreover, we introduce a custom image dataset where we systematically manipulate salient and semantic information. We find that ResNets are more sensitive to saliency information than ViTs, when trained with object classification objectives. We uncover that networks suppress saliency in early layers, a process enhanced by natural language supervision (CLIP) in ResNets. CLIP also enhances semantic encoding in both architectures. Finally, we show that semantic encoding is a key factor in aligning AI with human visual perception, while saliency suppression is a non-brain-like strategy.
- A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nature Neuroscience, 25(1):116–126, January 2022. ISSN 1097-6256, 1546-1726. doi: 10.1038/s41593-021-00962-x. URL https://www.nature.com/articles/s41593-021-00962-x.
- Understanding Human Object Vision: A Picture Is Worth a Thousand Representations. Annual Review of Psychology, 74(1):113–135, January 2023. ISSN 0066-4308, 1545-2085. doi: 10.1146/annurev-psych-032720-041031. URL https://www.annualreviews.org/doi/10.1146/annurev-psych-032720-041031.
- Universal Sentence Encoder. 2018. doi: 10.48550/ARXIV.1803.11175. URL https://arxiv.org/abs/1803.11175. Publisher: arXiv Version Number: 2.
- What can 1.8 billion regressions tell us about the pressures shaping high-level visual representation in brains and machines? preprint, Neuroscience, March 2022. URL http://biorxiv.org/lookup/doi/10.1101/2022.03.28.485868.
- ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, Miami, FL, June 2009. IEEE. ISBN 978-1-4244-3992-8. doi: 10.1109/CVPR.2009.5206848. URL https://ieeexplore.ieee.org/document/5206848/.
- Untangling invariant object recognition. Trends in Cognitive Sciences, 11(8):333–341, August 2007. ISSN 13646613. doi: 10.1016/j.tics.2007.06.010. URL https://linkinghub.elsevier.com/retrieve/pii/S1364661307001593.
- How Does the Brain Solve Visual Object Recognition? Neuron, 73(3):415–434, February 2012. ISSN 08966273. doi: 10.1016/j.neuron.2012.01.010. URL https://linkinghub.elsevier.com/retrieve/pii/S089662731200092X.
- Semantic scene descriptions as an objective of human vision, September 2022. URL http://arxiv.org/abs/2209.11737. arXiv:2209.11737 [cs, q-bio].
- Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, November 2020. ISSN 2522-5839. doi: 10.1038/s42256-020-00257-z. URL https://www.nature.com/articles/s42256-020-00257-z.
- Partial success in closing the gap between human and machine vision, October 2021. URL http://arxiv.org/abs/2106.07411. arXiv:2106.07411 [cs, q-bio].
- What do Vision Transformers Learn? A Visual Exploration. 2022. doi: 10.48550/ARXIV.2212.06727. URL https://arxiv.org/abs/2212.06727. Publisher: arXiv Version Number: 1.
- The Algonauts Project 2023 Challenge: How the Human Brain Makes Sense of Natural Scenes. 2023. doi: 10.48550/ARXIV.2301.03198. URL https://arxiv.org/abs/2301.03198. Publisher: arXiv Version Number: 4.
- Explaining and Harnessing Adversarial Examples, March 2015. URL http://arxiv.org/abs/1412.6572. arXiv:1412.6572 [cs, stat].
- Benchmarking Neural Network Robustness to Common Corruptions and Perturbations, March 2019. URL http://arxiv.org/abs/1903.12261. arXiv:1903.12261 [cs, stat].
- Geoffrey E. Hinton. Learning translation invariant recognition in a massively parallel networks. In G. Goos, J. Hartmanis, D. Barstow, W. Brauer, P. Brinch Hansen, D. Gries, D. Luckham, C. Moler, A. Pnueli, G. Seegmüller, J. Stoer, N. Wirth, J. W. Bakker, A. J. Nijman, and P. C. Treleaven (eds.), PARLE Parallel Architectures and Languages Europe, volume 258, pp. 1–13. Springer Berlin Heidelberg, Berlin, Heidelberg, 1987. ISBN 978-3-540-17943-6 978-3-540-47144-8. doi: 10.1007/3-540-17943-7˙117. URL http://link.springer.com/10.1007/3-540-17943-7_117. Series Title: Lecture Notes in Computer Science.
- A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(11):1254–1259, November 1998. ISSN 01628828. doi: 10.1109/34.730558. URL http://ieeexplore.ieee.org/document/730558/.
- A survey of the Vision Transformers and its CNN-Transformer based Variants. Artificial Intelligence Review, 56(S3):2917–2970, December 2023. ISSN 0269-2821, 1573-7462. doi: 10.1007/s10462-023-10595-0. URL http://arxiv.org/abs/2305.09880. arXiv:2305.09880 [cs].
- Nikolaus Kriegeskorte. Representational similarity analysis – connecting the branches of systems neuroscience. Frontiers in Systems Neuroscience, 2008. ISSN 16625137. doi: 10.3389/neuro.06.004.2008. URL http://journal.frontiersin.org/article/10.3389/neuro.06.004.2008/abstract.
- Words are all you need? Capturing human sensory similarity with textual descriptors, June 2022. URL http://arxiv.org/abs/2206.04105. arXiv:2206.04105 [cs, stat].
- Zoom in: An introduction to circuits. Distill, 2020. doi: 10.23915/distill.00024.001. https://distill.pub/2020/circuits/zoom-in.
- Learning Transferable Visual Models From Natural Language Supervision, February 2021. URL http://arxiv.org/abs/2103.00020. arXiv:2103.00020 [cs].
- Do Vision Transformers See Like Convolutional Neural Networks?, March 2022. URL http://arxiv.org/abs/2108.08810. arXiv:2108.08810 [cs, stat].
- Jan Theeuwes. Top–down and bottom–up control of visual selection. Acta Psychologica, 135(2):77–99, October 2010. ISSN 00016918. doi: 10.1016/j.actpsy.2010.02.006. URL https://linkinghub.elsevier.com/retrieve/pii/S0001691810000429.
- Stefan Treue. Visual attention: the where, what, how and why of saliency. Current Opinion in Neurobiology, 13(4):428–432, August 2003. ISSN 09594388. doi: 10.1016/S0959-4388(03)00105-3. URL https://linkinghub.elsevier.com/retrieve/pii/S0959438803001053.
- Presaccadic EEG activity predicts visual saliency in free‐viewing contour integration. Psychophysiology, 55(12):e13267, December 2018. ISSN 0048-5772, 1469-8986. doi: 10.1111/psyp.13267. URL https://onlinelibrary.wiley.com/doi/10.1111/psyp.13267.
- Natural language supervision with a large and diverse dataset builds better models of human high-level visual cortex. preprint, Neuroscience, September 2022. URL http://biorxiv.org/lookup/doi/10.1101/2022.09.27.508760.
- Billion-scale semi-supervised learning for image classification, May 2019. URL http://arxiv.org/abs/1905.00546. arXiv:1905.00546 [cs].
- Visualizing and Understanding Convolutional Networks, November 2013. URL http://arxiv.org/abs/1311.2901. arXiv:1311.2901 [cs].