NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction (2410.19452v3)
Abstract: Reconstruction of static visual stimuli from non-invasion brain activity fMRI achieves great success, owning to advanced deep learning models such as CLIP and Stable Diffusion. However, the research on fMRI-to-video reconstruction remains limited since decoding the spatiotemporal perception of continuous visual experiences is formidably challenging. We contend that the key to addressing these challenges lies in accurately decoding both high-level semantics and low-level perception flows, as perceived by the brain in response to video stimuli. To the end, we propose NeuroClips, an innovative framework to decode high-fidelity and smooth video from fMRI. NeuroClips utilizes a semantics reconstructor to reconstruct video keyframes, guiding semantic accuracy and consistency, and employs a perception reconstructor to capture low-level perceptual details, ensuring video smoothness. During inference, it adopts a pre-trained T2V diffusion model injected with both keyframes and low-level perception flows for video reconstruction. Evaluated on a publicly available fMRI-video dataset, NeuroClips achieves smooth high-fidelity video reconstruction of up to 6s at 8FPS, gaining significant improvements over state-of-the-art models in various metrics, e.g., a 128% improvement in SSIM and an 81% improvement in spatiotemporal metrics. Our project is available at https://github.com/gongzix/NeuroClips.
- Reconstructing the mind’s eye: fmri-to-image with contrastive learning and diffusion priors. Advances in Neural Information Processing Systems, 36, 2024.
- Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data. arXiv preprint arXiv:2403.11207, 2024.
- Reconstructing visual experiences from brain activity evoked by natural movies. Current biology, 21(19):1641–1646, 2011.
- Mindtuner: Cross-subject visual decoding with visual fingerprint and semantic correction. arXiv preprint arXiv:2404.12630, 2024.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Mlip: Efficient multi-perspective language-image pretraining with exhaustive data utilization. arXiv preprint arXiv:2406.01460, 2024.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Cinematic mindscapes: High-quality video reconstruction from brain activity. Advances in Neural Information Processing Systems, 36, 2024.
- Toward a direct measure of video quality perception using eeg. IEEE transactions on Image Processing, 21(5):2619–2629, 2012.
- EVE RABINOFF. Human Perception, pages 43–70. Northwestern University Press, 2018.
- Ervin S Ferry. Persistence of vision. American Journal of Science, 3(261):192–207, 1892.
- Ultra-high temporal resolution visual reconstruction from a fovea-like spike camera via spiking neuron model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):1233–1249, 2022.
- Caring: Learning temporal causal representation under non-invertible generation process. arXiv preprint arXiv:2401.14535, 2024.
- Persistence of vision display-a review. IOSR Journal of Electrical and Electronics Engineering (IOSR-JEEE) e-ISSN, 10(4):36–40, 2015.
- Persistence of vision: the interplay of vision. Vision, Memory and Media, page 13, 2010.
- A neural system for human visual working memory. Proceedings of the National Academy of Sciences, 95(3):883–890, 1998.
- James J. Gibson. Pictures, perspective, and perception. Daedalus, 89(1):216–227, 1960.
- Texture interpolation for probing visual perception. Advances in neural information processing systems, 33:22146–22157, 2020.
- Reconstructing perceived images from human brain activities with bayesian deep multiview learning. IEEE transactions on neural networks and learning systems, 30(8):2310–2323, 2018.
- Survey of encoding and decoding of visual stimulus via fmri: an image analysis perspective. Brain imaging and behavior, 8:7–23, 2014.
- Compressive spatial summation in human visual cortex. Journal of neurophysiology, 110(2):481–494, 2013.
- A comparison of fmri adaptation and multivariate pattern classification analysis in visual cortex. Neuroimage, 49(2):1632–1640, 2010.
- Spontaneous activity associated with primary visual cortex: a resting-state fmri study. Cerebral cortex, 18(3):697–704, 2008.
- The human visual cortex. Annu. Rev. Neurosci., 27(1):649–677, 2004.
- Generic decoding of seen and imagined objects using hierarchical visual features. Nature communications, 8(1):15037, 2017.
- Yu Takagi and Shinji Nishimoto. High-resolution image reconstruction with latent diffusion models from human brain activity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14453–14463, 2023.
- Unibrain: Unify image reconstruction and captioning all in one diffusion model from human brain activity. arXiv preprint arXiv:2308.07428, 2023.
- Lite-mind: Towards efficient and robust brain representation learning. In ACM Multimedia 2024, 2024.
- Wills aligner: A robust multi-subject brain representation learner. arXiv preprint arXiv:2404.13282, 2024.
- Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22710–22720, 2023.
- Neural encoding and decoding with deep learning for dynamic natural vision. Cerebral cortex, 28(12):4136–4160, 2018.
- Reconstructing rapid natural vision with fmri-conditional video generative adversarial network. Cerebral Cortex, 32(20):4502–4511, 2022.
- A penny for your (visual) thoughts: Self-supervised reconstruction of natural movies from brain activity. arXiv preprint arXiv:2206.03544, 2022.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4296–4304, 2024.
- Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
- Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
- Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
- Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
- Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
- Brain mechanisms underlying cue-based memorizing during free viewing of movie memento. NeuroImage, 172:313–325, 2018.
- Video abstraction based on fmri-driven visual attention model. Information sciences, 281:781–796, 2014.
- Bridging the semantic gap via functional brain imaging. IEEE Transactions on Multimedia, 14(2):314–325, 2011.
- A comprehensive survey and mathematical insights towards video summarization. Journal of Visual Communication and Image Representation, 89:103670, 2022.
- Through their eyes: multi-subject brain decoding with simple alignment techniques. Imaging Neuroscience, 2:1–21, 2024.
- Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
- Mixco: Mix-up contrastive learning for visual representation. arXiv preprint arXiv:2010.06300, 2020.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, October 2023.
- The minimal preprocessing pipelines for the human connectome project. Neuroimage, 80:105–124, 2013.
- A multi-modal parcellation of human cerebral cortex. Nature, 536(7615):171–178, 2016.
- Variational autoencoder: An unsupervised model for encoding and decoding fmri activity in visual cortex. NeuroImage, 198:125–136, 2019.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35:10078–10093, 2022.
- Animate your thoughts: Decoupled reconstruction of dynamic natural vision from slow brain activity. arXiv preprint arXiv:2405.03280, 2024.
- Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Super-convergence: Very fast training of neural networks using large learning rates. In Artificial intelligence and machine learning for multi-domain operations applications, volume 11006, pages 369–386. SPIE, 2019.
- Sparsectrl: Adding sparse controls to text-to-video diffusion models. arXiv preprint arXiv:2311.16933, 2023.