Masked Image Modeling as a Framework for Self-Supervised Learning across Eye Movements (2404.08526v2)
Abstract: To make sense of their surroundings, intelligent systems must transform complex sensory inputs to structured codes that are reduced to task-relevant information such as object category. Biological agents achieve this in a largely autonomous manner, presumably via self-supervised learning. Whereas previous attempts to model the underlying mechanisms were largely discriminative in nature, there is ample evidence that the brain employs a generative model of the world. Here, we propose that eye movements, in combination with the focused nature of primate vision, constitute a generative, self-supervised task of predicting and revealing visual information. We construct a proof-of-principle model starting from the framework of masked image modeling (MIM), a common approach in deep representation learning. To do so, we analyze how core components of MIM such as masking technique and data augmentation influence the formation of category-specific representations. This allows us not only to better understand the principles behind MIM, but to then reassemble a MIM more in line with the focused nature of biological perception. We find that MIM disentangles neurons in latent space without explicit regularization, a property that has been suggested to structure visual representations in primates. Together with previous findings of invariance learning, this highlights an interesting connection of MIM to latent regularization approaches for self-supervised learning. The source code is available under https://github.com/RobinWeiler/FocusMIM
- Yann LeCun “A path towards autonomous machine intelligence version 0.9.2, 2022-06-27” In Open Review 62.1, 2022
- Aaron van den Oord, Yazhe Li and Oriol Vinyals “Representation learning with contrastive predictive coding” In arXiv preprint arXiv:1807.03748, 2018
- Adrien Bardes, Jean Ponce and Yann LeCun “Vicreg: Variance-invariance-covariance regularization for self-supervised learning” In arXiv preprint arXiv:2105.04906, 2021
- “Bootstrap your own latent-a new approach to self-supervised learning” In Advances in neural information processing systems 33, 2020, pp. 21271–21284
- “Context encoders: Feature learning by inpainting” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2536–2544
- “Masked autoencoders are scalable vision learners” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16000–16009
- “Self-supervised learning from images with a joint-embedding predictive architecture” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15619–15629
- Manu Srinath Halvagal and Friedemann Zenke “The combination of Hebbian and predictive plasticity learns invariant object representations in deep sensory networks” In Nature Neuroscience 26.11 Nature Publishing Group US New York, 2023, pp. 1906–1915
- “Local plasticity rules can learn deep representations using self-supervised contrastive predictions” In Advances in Neural Information Processing Systems 34, 2021, pp. 30365–30379
- Daniel J Simons and Christopher F Chabris “Gorillas in our midst: Sustained inattentional blindness for dynamic events” In perception 28.9 SAGE Publications Sage UK: London, England, 1999, pp. 1059–1074
- Trinity B Crapse and Marc A Sommer “The frontal eye field as a prediction map” In Progress in brain research 171 Elsevier, 2008, pp. 383–390
- Erich Von Holst and Horst Mittelstaedt “Das Reafferenzprinzip: Wechselwirkungen zwischen Zentralnervensystem und Peripherie” In Naturwissenschaften 37.20 Springer, 1950, pp. 464–476
- “The helmholtz machine” In Neural computation 7.5 MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …, 1995, pp. 889–904
- Rajesh PN Rao and Dana H Ballard “Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects” In Nature neuroscience 2.1 Nature Publishing Group, 1999, pp. 79–87
- Karl Friston “A theory of cortical responses” In Philosophical transactions of the Royal Society B: Biological sciences 360.1456 The Royal Society London, 2005, pp. 815–836
- Cyriel MA Pennartz “The brain’s representational power: on consciousness and the integration of modalities” MIT Press, 2015
- “Understanding Masked Autoencoders via Hierarchical Latent Variable Models” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7918–7928
- Thomas Tsao and Doris Y Tsao “A topological solution to object segmentation and tracking” In Proceedings of the National Academy of Sciences 119.41 National Acad Sciences, 2022, pp. e2204248119
- “Learning to segment self-generated from externally caused optic flow through sensorimotor mismatch circuits” In bioRxiv preprint biorXiv:10.1101/2023.11.15.567170v2 Cold Spring Harbor Laboratory, 2023, pp. 2023–11
- “Efficient visual pretraining with contrastive detection” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10086–10096
- “Convnext v2: Co-designing and scaling convnets with masked autoencoders” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 16133–16142
- “Simmim: A simple framework for masked image modeling” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 9653–9663
- “Backpropagation and the brain” In Nature Reviews Neuroscience 21.6 Nature Publishing Group UK London, 2020, pp. 335–346
- James CR Whittington and Rafal Bogacz “An approximation of the error backpropagation algorithm in a predictive coding network with local hebbian synaptic plasticity” In Neural computation 29.5 MIT Press One Rogers Street, Cambridge, MA 02142-1209, USA journals-info …, 2017, pp. 1229–1262
- “An information maximization model of eye movements” In Advances in neural information processing systems 17, 2004
- Hugo Larochelle and Geoffrey E Hinton “Learning to combine foveal glimpses with a third-order Boltzmann machine” In Advances in neural information processing systems 23, 2010
- Jessica AF Thompson, Hannah Sheahan and Christopher Summerfield “Learning to count visual objects by combining” what” and” where” in recurrent memory” In NeuRIPS 2022 Workshop on Gaze Meets ML, 2022, pp. 199–218
- “Estimating receptive field size from fMRI data in human striate and extrastriate visual cortex” In Cerebral cortex 11.12 Oxford University Press, 2001, pp. 1182–1190
- “Deep residual learning for image recognition” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778
- Adam Coates, Andrew Ng and Honglak Lee “An analysis of single-layer networks in unsupervised feature learning” In Proceedings of the fourteenth international conference on artificial intelligence and statistics, 2011, pp. 215–223 JMLR WorkshopConference Proceedings
- “Neuroscience, 3rd Edition” Sinauer Associates, 2004
- Jeffrey S Perry and Wilson S Geisler “Gaze-contingent real-time simulation of arbitrary visual fields” In Human vision and electronic imaging VII 4662, 2002, pp. 57–69 SPIE
- “Salicon: Saliency in context” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1072–1080
- Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In arXiv preprint arXiv:1412.6980, 2014
- Horace B Barlow “Possible principles underlying the transformation of sensory messages” In Sensory communication 1.01, 1961, pp. 217–233
- “Barlow twins: Self-supervised learning via redundancy reduction” In International conference on machine learning, 2021, pp. 12310–12320 PMLR
- “Eye movements in natural behavior” In Trends in cognitive sciences 9.4 Elsevier, 2005, pp. 188–194
- Alfred L Yarbus “Eye movements and vision” Plenum Press, 1967
- Bruno A Olshausen and David J Field “Emergence of simple-cell receptive field properties by learning a sparse code for natural images” In Nature 381.6583 Nature Publishing Group UK London, 1996, pp. 607–609
- R Houtkamp, H Spekreijse and PR Roelfsema “A gradual spread of attention” In Perception & psychophysics 65 Springer, 2003, pp. 1136–1144
- “Understanding masked image modeling via learning occlusion invariant feature” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6241–6251
- Sindy Löwe, Peter O’Connor and Bastiaan Veeling “Putting an end to end-to-end: Gradient-isolated learning of representations” In Advances in neural information processing systems 32, 2019
- “Blockwise self-supervised learning at scale” In arXiv preprint arXiv:2302.01647, 2023
- Christoph Feichtenhofer, Yanghao Li and Kaiming He “Masked autoencoders as spatiotemporal learners” In Advances in neural information processing systems 35, 2022, pp. 35946–35958