Identifying Interpretable Visual Features in Artificial and Biological Neural Systems (2310.11431v2)
Abstract: Single neurons in neural networks are often interpretable in that they represent individual, intuitively meaningful features. However, many neurons exhibit $\textit{mixed selectivity}$, i.e., they represent multiple unrelated features. A recent hypothesis proposes that features in deep networks may be represented in $\textit{superposition}$, i.e., on non-orthogonal axes by multiple neurons, since the number of possible interpretable features in natural data is generally larger than the number of neurons in a given network. Accordingly, we should be able to find meaningful directions in activation space that are not aligned with individual neurons. Here, we propose (1) an automated method for quantifying visual interpretability that is validated against a large database of human psychophysics judgments of neuron interpretability, and (2) an approach for finding meaningful directions in network activation space. We leverage these methods to discover directions in convolutional neural networks that are more intuitively meaningful than individual neurons, as we confirm and investigate in a series of analyses. Moreover, we apply the same method to three recent datasets of visual neural responses in the brain and find that our conclusions largely transfer to real neural data, suggesting that superposition might be deployed by the brain. This also provides a link with disentanglement and raises fundamental questions about robust, efficient and factorized representations in both artificial and biological neural systems.
- A massive 7t fmri dataset to bridge cognitive neuroscience and artificial intelligence. Nature neuroscience, 25(1):116–126, 2022.
- Neural correlations, population coding and computation. Nature reviews neuroscience, 7(5):358–366, 2006.
- The sparseness of mixed selectivity neurons controls the generalization–discrimination trade-off. Journal of Neuroscience, 33(9):3844–3856, 2013.
- Horace B Barlow. Summation and inhibition in the frog’s retina. The Journal of physiology, 119(1):69, 1953.
- Horace B Barlow. Single units and sensation: a neuron doctrine for perceptual psychology? Perception, 1(4):371–394, 1972.
- Language models can explain neurons in language models. URL https://openaipublic. blob. core. windows. net/neuron-explainer/paper/index. html.(Date accessed: 14.05. 2023), 2023.
- Exemplary natural images explain cnn activations better than state-of-the-art feature visualization. arXiv preprint arXiv:2010.12606, 2020.
- Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
- The sparse manifold transform. Advances in neural information processing systems, 31, 2018.
- Towards automated circuit discovery for mechanistic interpretability. arXiv preprint arXiv:2304.14997, 2023.
- Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023.
- Theoretical neuroscience: computational and mathematical modeling of neural systems. MIT press, 2005.
- Representational drift: Emerging theories for continual learning and experimental future directions. Current Opinion in Neurobiology, 76:102609, 2022.
- Unsupervised model selection for variational disentangled representation learning. arXiv preprint arXiv:1905.12614, 2019.
- A framework for the quantitative evaluation of disentangled representations. In International conference on learning representations, 2018.
- The population doctrine in cognitive neuroscience. Neuron, 109(19):3055–3068, 2021.
- Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022.
- Visualizing higher-layer features of a deep network. University of Montreal, 1341(3):1, 2009.
- Stanley Finger. Origins of neuroscience: a history of explorations into brain function. Oxford University Press, 2001.
- Training batchnorm and only batchnorm: On the expressive power of random features in cnns. arXiv preprint arXiv:2003.00152, 2020.
- Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological cybernetics, 36(4):193–202, 1980.
- Why neurons mix: high dimensionality for higher cognition. Current opinion in neurobiology, 37:66–74, 2016.
- On simplicity and complexity in the brave new world of large-scale neuroscience. Current opinion in neurobiology, 32:148–155, 2015.
- Samuel J Gershman. What have we learned about artificial intelligence from studying the brain?, 2023.
- Visual properties of neurons in inferotemporal cortex of the macaque. Journal of neurophysiology, 35(1):96–111, 1972.
- Adversarially trained neural representations are already as robust as biological neural representations. In International Conference on Machine Learning, pp. 8072–8081. PMLR, 2022.
- Finding neurons in a haystack: Case studies with sparse probing. arXiv preprint arXiv:2305.01610, 2023.
- Neuroscience-inspired artificial intelligence. Neuron, 95(2):245–258, 2017.
- Donald Olding Hebb. The organization of behavior: A neuropsychological theory. Psychology press, 2005.
- beta-vae: Learning basic visual concepts with a constrained variational framework. In International conference on learning representations, 2016.
- Unsupervised deep learning identifies semantic disentanglement in single inferotemporal face patch neurons. Nature communications, 12(1):6456, 2021.
- Receptive fields of single neurones in the cat’s striate cortex. The Journal of physiology, 148(3):574, 1959.
- Receptive fields and functional architecture of monkey striate cortex. The Journal of physiology, 195(1):215–243, 1968.
- Fast readout of object identity from macaque inferior temporal cortex. Science, 310(5749):863–866, 2005.
- Unsupervised feature extraction by time-contrastive learning and nonlinear ica. Advances in neural information processing systems, 29, 2016.
- Nonlinear ica of temporally dependent stationary sources. In Artificial Intelligence and Statistics, pp. 460–469. PMLR, 2017.
- Nonlinear independent component analysis: Existence and uniqueness results. Neural networks, 12(3):429–439, 1999.
- Identifiability of latent-variable and structural-equation models: from linear to nonlinear. arXiv preprint arXiv:2302.02672, 2023.
- Ruling out and ruling in neural codes. Proceedings of the National Academy of Sciences, 106(14):5936–5941, 2009.
- Nonlinear mixed selectivity supports reliable neural computation. PLoS computational biology, 16(2):e1007544, 2020.
- Neural system identification for large 579 populations separating “what” and “where.”. Advances in Neural Information Processing 580 Systems, 2017.
- Towards nonlinear disentanglement in natural data with temporal sparse coding. arXiv preprint arXiv:2007.10930, 2020.
- Representational similarity analysis-connecting the branches of systems neuroscience. Frontiers in systems neuroscience, pp. 4, 2008.
- How does an fmri voxel sample the neuronal activity pattern: compact-kernel or complex spatiotemporal filter? Neuroimage, 49(3):1965–1976, 2010.
- Learning multiple layers of features from tiny images. Toronto, ON, Canada, 2009.
- Communication in neuronal networks. Science, 301(5641):1870–1874, 2003.
- Towards falsifiable interpretability research. arXiv preprint arXiv:2010.12016, 2020.
- Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
- Convergent learning: Do different neural networks learn the same representations? arXiv preprint arXiv:1511.07543, 2015.
- Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pp. 4114–4124. PMLR, 2019.
- On the importance of single directions for generalization. arXiv preprint arXiv:1803.06959, 2018.
- Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217, 2023.
- Understanding neural networks via feature visualization: A survey. Explainable AI: interpreting, explaining and visualizing deep learning, pp. 55–76, 2019.
- An overview of early vision in inceptionv1. Distill, 5(4):e00024–002, 2020.
- Selectivity and robustness of sparse coding networks. Journal of vision, 20(12):10–10, 2020.
- Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review. International Journal of Automation and Computing, 14(5):503–519, 2017.
- Invariant visual representation by single neurons in the human brain. Nature, 435(7045):1102–1107, 2005.
- Hierarchical models of object recognition in cortex. Nature neuroscience, 2(11):1019–1025, 1999.
- The importance of mixed selectivity in complex cognitive tasks. Nature, 497(7451):585–590, 2013.
- Representations in human primary visual cortex drift over time. Nature Communications, 14(1):4422, 2023.
- Taking features out of superposition with sparse autoencoders. https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition#Method_1__The_presence_of_dead_neurons, 2022. [Online; accessed 26-Sept-2023].
- Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
- Garrett B Stanley. Reading and writing the neural code. Nature neuroscience, 16(3):259–263, 2013.
- Pirmin Stekeler-Weithofer. Grundprobleme der Logik: Elemente einer Kritik der formalen Vernunft. Walter de Gruyter, 2012.
- High-dimensional geometry of population responses in visual cortex. Nature, 571(7765):361–365, 2019.
- A cortical region consisting entirely of face-selective cells. Science, 311(5761):670–674, 2006.
- Rotation-invariant clustering of neuronal responses in primary visual cortex. In International Conference on Learning Representations, 2019.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- The neural code for “face cells” is not face-specific. Science Advances, 9(35):eadg1736, 2023.
- Deep learning-driven characterization of single cell tuning in primate visual area v4 unveils topological organization. bioRxiv, pp. 2023–05, 2023.
- Ludwig Wittgenstein. Philosophische untersuchungen. Frankfurt: Suhrkamp, 1953.
- Odorant category profile selectivity of olfactory cortex neurons. Journal of Neuroscience, 27(34):9105–9114, 2007.
- Catalyzing next-generation artificial intelligence through neuroai. Nature communications, 14(1):1597, 2023.
- The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–595, 2018.
- How well do feature visualizations support causal understanding of cnn activations? Advances in Neural Information Processing Systems, 34:11730–11744, 2021.
- Scale alone does not improve mechanistic interpretability in vision models. arXiv preprint arXiv:2307.05471, 2023.