Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction (2410.19452v3)

Published 25 Oct 2024 in eess.IV, cs.AI, and cs.CV

Abstract: Reconstruction of static visual stimuli from non-invasion brain activity fMRI achieves great success, owning to advanced deep learning models such as CLIP and Stable Diffusion. However, the research on fMRI-to-video reconstruction remains limited since decoding the spatiotemporal perception of continuous visual experiences is formidably challenging. We contend that the key to addressing these challenges lies in accurately decoding both high-level semantics and low-level perception flows, as perceived by the brain in response to video stimuli. To the end, we propose NeuroClips, an innovative framework to decode high-fidelity and smooth video from fMRI. NeuroClips utilizes a semantics reconstructor to reconstruct video keyframes, guiding semantic accuracy and consistency, and employs a perception reconstructor to capture low-level perceptual details, ensuring video smoothness. During inference, it adopts a pre-trained T2V diffusion model injected with both keyframes and low-level perception flows for video reconstruction. Evaluated on a publicly available fMRI-video dataset, NeuroClips achieves smooth high-fidelity video reconstruction of up to 6s at 8FPS, gaining significant improvements over state-of-the-art models in various metrics, e.g., a 128% improvement in SSIM and an 81% improvement in spatiotemporal metrics. Our project is available at https://github.com/gongzix/NeuroClips.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Reconstructing the mind’s eye: fmri-to-image with contrastive learning and diffusion priors. Advances in Neural Information Processing Systems, 36, 2024.
  2. Mindeye2: Shared-subject models enable fmri-to-image with 1 hour of data. arXiv preprint arXiv:2403.11207, 2024.
  3. Reconstructing visual experiences from brain activity evoked by natural movies. Current biology, 21(19):1641–1646, 2011.
  4. Mindtuner: Cross-subject visual decoding with visual fingerprint and semantic correction. arXiv preprint arXiv:2404.12630, 2024.
  5. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  6. Mlip: Efficient multi-perspective language-image pretraining with exhaustive data utilization. arXiv preprint arXiv:2406.01460, 2024.
  7. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  8. Cinematic mindscapes: High-quality video reconstruction from brain activity. Advances in Neural Information Processing Systems, 36, 2024.
  9. Toward a direct measure of video quality perception using eeg. IEEE transactions on Image Processing, 21(5):2619–2629, 2012.
  10. EVE RABINOFF. Human Perception, pages 43–70. Northwestern University Press, 2018.
  11. Ervin S Ferry. Persistence of vision. American Journal of Science, 3(261):192–207, 1892.
  12. Ultra-high temporal resolution visual reconstruction from a fovea-like spike camera via spiking neuron model. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1):1233–1249, 2022.
  13. Caring: Learning temporal causal representation under non-invertible generation process. arXiv preprint arXiv:2401.14535, 2024.
  14. Persistence of vision display-a review. IOSR Journal of Electrical and Electronics Engineering (IOSR-JEEE) e-ISSN, 10(4):36–40, 2015.
  15. Persistence of vision: the interplay of vision. Vision, Memory and Media, page 13, 2010.
  16. A neural system for human visual working memory. Proceedings of the National Academy of Sciences, 95(3):883–890, 1998.
  17. James J. Gibson. Pictures, perspective, and perception. Daedalus, 89(1):216–227, 1960.
  18. Texture interpolation for probing visual perception. Advances in neural information processing systems, 33:22146–22157, 2020.
  19. Reconstructing perceived images from human brain activities with bayesian deep multiview learning. IEEE transactions on neural networks and learning systems, 30(8):2310–2323, 2018.
  20. Survey of encoding and decoding of visual stimulus via fmri: an image analysis perspective. Brain imaging and behavior, 8:7–23, 2014.
  21. Compressive spatial summation in human visual cortex. Journal of neurophysiology, 110(2):481–494, 2013.
  22. A comparison of fmri adaptation and multivariate pattern classification analysis in visual cortex. Neuroimage, 49(2):1632–1640, 2010.
  23. Spontaneous activity associated with primary visual cortex: a resting-state fmri study. Cerebral cortex, 18(3):697–704, 2008.
  24. The human visual cortex. Annu. Rev. Neurosci., 27(1):649–677, 2004.
  25. Generic decoding of seen and imagined objects using hierarchical visual features. Nature communications, 8(1):15037, 2017.
  26. Yu Takagi and Shinji Nishimoto. High-resolution image reconstruction with latent diffusion models from human brain activity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14453–14463, 2023.
  27. Unibrain: Unify image reconstruction and captioning all in one diffusion model from human brain activity. arXiv preprint arXiv:2308.07428, 2023.
  28. Lite-mind: Towards efficient and robust brain representation learning. In ACM Multimedia 2024, 2024.
  29. Wills aligner: A robust multi-subject brain representation learner. arXiv preprint arXiv:2404.13282, 2024.
  30. Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22710–22720, 2023.
  31. Neural encoding and decoding with deep learning for dynamic natural vision. Cerebral cortex, 28(12):4136–4160, 2018.
  32. Reconstructing rapid natural vision with fmri-conditional video generative adversarial network. Cerebral Cortex, 32(20):4502–4511, 2022.
  33. A penny for your (visual) thoughts: Self-supervised reconstruction of natural movies from brain activity. arXiv preprint arXiv:2206.03544, 2022.
  34. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  35. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  36. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  37. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  38. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  39. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  40. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4296–4304, 2024.
  41. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22563–22575, 2023.
  42. Make-a-video: Text-to-video generation without text-video data. arXiv preprint arXiv:2209.14792, 2022.
  43. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
  44. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023.
  45. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  46. Brain mechanisms underlying cue-based memorizing during free viewing of movie memento. NeuroImage, 172:313–325, 2018.
  47. Video abstraction based on fmri-driven visual attention model. Information sciences, 281:781–796, 2014.
  48. Bridging the semantic gap via functional brain imaging. IEEE Transactions on Multimedia, 14(2):314–325, 2011.
  49. A comprehensive survey and mathematical insights towards video summarization. Journal of Visual Communication and Image Representation, 89:103670, 2022.
  50. Through their eyes: multi-subject brain decoding with simple alignment techniques. Imaging Neuroscience, 2:1–21, 2024.
  51. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294, 2022.
  52. Mixco: Mix-up contrastive learning for visual representation. arXiv preprint arXiv:2010.06300, 2020.
  53. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
  54. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, October 2023.
  55. The minimal preprocessing pipelines for the human connectome project. Neuroimage, 80:105–124, 2013.
  56. A multi-modal parcellation of human cerebral cortex. Nature, 536(7615):171–178, 2016.
  57. Variational autoencoder: An unsupervised model for encoding and decoding fmri activity in visual cortex. NeuroImage, 198:125–136, 2019.
  58. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  59. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  60. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35:10078–10093, 2022.
  61. Animate your thoughts: Decoupled reconstruction of dynamic natural vision from slow brain activity. arXiv preprint arXiv:2405.03280, 2024.
  62. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022.
  63. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  64. Super-convergence: Very fast training of neural networks using large learning rates. In Artificial intelligence and machine learning for multi-domain operations applications, volume 11006, pages 369–386. SPIE, 2019.
  65. Sparsectrl: Adding sparse controls to text-to-video diffusion models. arXiv preprint arXiv:2311.16933, 2023.

Summary

  • The paper presents NeuroClips, a novel framework that decodes video from fMRI signals by aligning low-level perceptual flows with high-level semantic data.
  • It employs a dual-component strategy with a Perception Reconstructor for smooth dynamic scene generation and a Semantics Reconstructor for detailed keyframe recovery.
  • Experimental validation shows a 128% improvement in SSIM and an 81% enhancement in spatiotemporal metrics over state-of-the-art models.

NeuroClips: Advancing fMRI-to-Video Reconstruction

The pursuit of decoding visual stimuli from brain activity, particularly using functional magnetic resonance imaging (fMRI), has attracted attention due to its potential in understanding the complex perceptual and semantic functions of the human brain. While the reconstruction of static images from fMRI signals has seen promising advancements, translating this capability into video reconstruction remains a formidable challenge. The paper "NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction" addresses these challenges by introducing NeuroClips, a novel framework aimed at achieving high-fidelity and smooth video reconstruction from fMRI data.

Framework and Methodology

NeuroClips innovatively seeks to bridge the temporal and spatial resolution gap between fMRI signals and video stimuli by incorporating both high-level semantic and low-level perceptual content. The framework primarily consists of two key components: the Perception Reconstructor (PR) and the Semantics Reconstructor (SR).

  1. Perception Reconstructor (PR): This component is tasked with generating a 'blurry' video that encapsulates low-level perceptual flows derived from fMRI signals. The PR ensures the smoothness and consistency of the generated video through modules such as Inception Extension and Temporal Upsampling. The focus here is on aligning multi-dimensional fMRI embeddings with video frames, thus capturing motions and dynamic scenes without focusing on semantic detail.
  2. Semantics Reconstructor (SR): In contrast, the SR focuses on reconstructing high-quality keyframes that capture the semantic richness of the visual stimuli. By utilizing advanced techniques such as diffusion prior and contrastive learning, the SR aligns fMRI-derived embeddings with image space representations, benefiting from the CLIP model’s capacity to integrate multiple modalities.

During the inference process, NeuroClips employs a text-to-video (T2V) diffusion model. Here, the reconstructed keyframes and low-level perception flows guide the generation of videos with high fidelity and consistency.

Experimental Validation and Results

Extensive evaluations against state-of-the-art (SOTA) models on a publicly available fMRI-video dataset reveal that NeuroClips delivers substantial improvements across various metrics. Notably, NeuroClips achieves a 128% improvement in the Structural Similarity Index Measure (SSIM) and an 81% enhancement in spatiotemporal metrics. Such improvements underscore the efficacy of combining low-level perceptual information with high-level semantic reconstruction in generating coherent and high-fidelity videos from fMRI inputs.

Implications and Future Directions

The implications of this research are far-reaching, both from a theoretical and practical perspective. The ability to decode video from fMRI can significantly advance our understanding of human visual perception and cognition. It offers promising applications in fields such as neuroimaging, cognitive neuroscience, and even potential interfaces for brain-machine communication.

Nevertheless, certain limitations persist, particularly concerning the framework's ability to handle cross-scene fMRI recordings or when dealing with out-of-distribution data. Future research directions could focus on enhancing the model's generalizability across diverse datasets and exploring the potential to support longer video reconstructions seamlessly. Additionally, improvements in the computational efficiency of video generative models could broaden the applicability of such frameworks in real-world scenarios.

In summary, NeuroClips represents a significant advancement in the domain of fMRI-to-video decoding. Through sophisticated integration of perceptual flows and semantic learning, it sets a new benchmark for reconstructing videos from brain activity, paving the way for future breakthroughs in the understanding and application of brain-machine interfacing technologies.

Youtube Logo Streamline Icon: https://streamlinehq.com