Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dynamics Based Neural Encoding with Inter-Intra Region Connectivity (2402.12519v3)

Published 19 Feb 2024 in cs.CV

Abstract: Extensive literature has drawn comparisons between recordings of biological neurons in the brain and deep neural networks. This comparative analysis aims to advance and interpret deep neural networks and enhance our understanding of biological neural systems. However, previous works did not consider the time aspect and how the encoding of video and dynamics in deep networks relate to the biological neural systems within a large-scale comparison. Towards this end, we propose the first large-scale study focused on comparing video understanding models with respect to the visual cortex recordings using video stimuli. The study encompasses more than two million regression fits, examining image vs. video understanding, convolutional vs. transformer-based and fully vs. self-supervised models. Additionally, we propose a novel neural encoding scheme to better encode biological neural systems. We provide key insights on how video understanding models predict visual cortex responses; showing video understanding better than image understanding models, convolutional models are better in the early-mid visual cortical regions than transformer based ones except for multiscale transformers, and that two-stream models are better than single stream. Furthermore, we propose a novel neural encoding scheme that is built on top of the best performing video understanding models, while incorporating inter-intra region connectivity across the visual cortex. Our neural encoding leverages the encoded dynamics from video stimuli, through utilizing two-stream networks and multiscale transformers, while taking connectivity priors into consideration. Our results show that merging both intra and inter-region connectivity priors increases the encoding performance over each one of them standalone or no connectivity priors. It also shows the necessity for encoding dynamics to fully benefit from such connectivity priors.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Improving vision transformers by revisiting high-frequency components. In Proceedings of the European Conference on Computer Vision, pp.  1–18. Springer, 2022.
  2. Is space-time attention all you need for video understanding? In Proceedings of the International Conference on Machine Learning, volume 2, pp.  4, 2021.
  3. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  4. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9650–9660, 2021.
  5. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  6299–6308, 2017.
  6. The algonauts project. Nature Machine Intelligence, 1(12):613–613, 2019.
  7. The algonauts project 2021 challenge: How the human brain makes sense of a world in motion. arXiv preprint arXiv:2104.13714, 2021.
  8. Neural regression, representational similarity, model zoology & neural taskonomy at scale in rodent visual cortex. Advances in Neural Information Processing Systems, 34:5590–5607, 2021a.
  9. What can 5.17 billion regression fits tell us about artificial models of the human visual system? In Shared Visual Representations in Human and Machine Intelligence workshop at Conference on Neural Information Processing Systems, 2021b.
  10. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp.  248–255. Ieee, 2009.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  12. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  6824–6835, 2021.
  13. Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  203–213, 2020.
  14. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  6202–6211, 2019.
  15. Masked autoencoders as spatiotemporal learners. Advances in Neural Information Processing Systems, 35:35946–35958, 2022.
  16. A functional and perceptual signature of the second visual area in primates. Nature Neuroscience, 16(7):974–981, 2013.
  17. The algonauts project 2023 challenge: How the human brain makes sense of natural scenes. arXiv preprint arXiv:2301.03198, 2023.
  18. Omnimae: Single model masked pretraining on images and videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10406–10417, 2023.
  19. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pp.  5842–5850, 2017.
  20. Modeling the dynamics of human brain activity with recurrent neural networks. Frontiers in computational neuroscience, 11:7, 2017.
  21. System identification of neural systems: If we got it right, would we know? In International Conference on Machine Learning, pp.  12430–12444. PMLR, 2023.
  22. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  770–778, 2016.
  23. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  16000–16009, 2022.
  24. Deep recurrent spiking neural networks capture both static and dynamic representations of the visual cortex under movie stimuli. arXiv preprint arXiv:2306.01354, 2023.
  25. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  26. Cortical response to naturalistic stimuli is largely predictable with deep neural networks. Science Advances, 7(22):eabe7547, 2021.
  27. Yu Kong and Yun Fu. Human action recognition and prediction: A survey. International Journal of Computer Vision, 130(5):1366–1401, 2022.
  28. A deeper dive into what deep spatiotemporal networks encode: Quantifying static vs. dynamic information. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13999–14009, 2022.
  29. Benjamin Lahner. An fMRI dataset of 1,102 natural videos for visual event understanding. PhD thesis, Massachusetts Institute of Technology, 2022.
  30. Bold moments: modeling short visual events through a video fmri dataset and metadata. bioRxiv, pp.  2023–03, 2023.
  31. Collaborative spatiotemporal feature learning for video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7872–7881, 2019.
  32. Very sparse random projections. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, pp.  287–296, 2006.
  33. Simple learned weighted sums of inferior temporal neuronal firing rates accurately predict human core object recognition performance. Journal of Neuroscience, 35(39):13402–13418, 2015.
  34. Shinji Nishimoto. Modeling movie-evoked human brain activity using motion-energy and space-time vision transformer features. BioRxiv, pp.  2021–08, 2021.
  35. Reconstructing visual experiences from brain activity evoked by natural movies. Current biology, 21(19):1641–1646, 2011.
  36. Brain-score: Which artificial neural network for object recognition is most brain-like? BioRxiv, pp.  407007, 2018.
  37. Deep recurrent neural network reveals a hierarchy of process memory during dynamic natural vision. Human brain mapping, 39(5):2269–2282, 2018.
  38. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision, pp.  510–526. Springer, 2016.
  39. Stimulus domain transfer in recurrent models for large scale cortical population prediction on video. Advances in neural information processing systems, 31, 2018.
  40. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  6450–6459, 2018.
  41. Video classification with channel-separated convolutional networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  5552–5561, 2019.
  42. Early convolutions help transformers see better. Advances in Neural Information Processing Systems, 34:30392–30400, 2021.
  43. Exploring the brain-like properties of deep neural networks: a neural encoding perspective. Machine Intelligence Research, 19(5):439–455, 2022.
  44. Unsupervised neural network models of the ventral visual stream. Proceedings of the National Academy of Sciences, 118(3):e2014196118, 2021.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com