NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction (2410.19452v3)

Published 25 Oct 2024 in eess.IV, cs.AI, and cs.CV

Abstract: Reconstruction of static visual stimuli from non-invasion brain activity fMRI achieves great success, owning to advanced deep learning models such as CLIP and Stable Diffusion. However, the research on fMRI-to-video reconstruction remains limited since decoding the spatiotemporal perception of continuous visual experiences is formidably challenging. We contend that the key to addressing these challenges lies in accurately decoding both high-level semantics and low-level perception flows, as perceived by the brain in response to video stimuli. To the end, we propose NeuroClips, an innovative framework to decode high-fidelity and smooth video from fMRI. NeuroClips utilizes a semantics reconstructor to reconstruct video keyframes, guiding semantic accuracy and consistency, and employs a perception reconstructor to capture low-level perceptual details, ensuring video smoothness. During inference, it adopts a pre-trained T2V diffusion model injected with both keyframes and low-level perception flows for video reconstruction. Evaluated on a publicly available fMRI-video dataset, NeuroClips achieves smooth high-fidelity video reconstruction of up to 6s at 8FPS, gaining significant improvements over state-of-the-art models in various metrics, e.g., a 128% improvement in SSIM and an 81% improvement in spatiotemporal metrics. Our project is available at https://github.com/gongzix/NeuroClips.

References (65)

Summary

The paper presents NeuroClips, a novel framework that decodes video from fMRI signals by aligning low-level perceptual flows with high-level semantic data.
It employs a dual-component strategy with a Perception Reconstructor for smooth dynamic scene generation and a Semantics Reconstructor for detailed keyframe recovery.
Experimental validation shows a 128% improvement in SSIM and an 81% enhancement in spatiotemporal metrics over state-of-the-art models.

NeuroClips: Advancing fMRI-to-Video Reconstruction

The pursuit of decoding visual stimuli from brain activity, particularly using functional magnetic resonance imaging (fMRI), has attracted attention due to its potential in understanding the complex perceptual and semantic functions of the human brain. While the reconstruction of static images from fMRI signals has seen promising advancements, translating this capability into video reconstruction remains a formidable challenge. The paper "NeuroClips: Towards High-fidelity and Smooth fMRI-to-Video Reconstruction" addresses these challenges by introducing NeuroClips, a novel framework aimed at achieving high-fidelity and smooth video reconstruction from fMRI data.

Framework and Methodology

NeuroClips innovatively seeks to bridge the temporal and spatial resolution gap between fMRI signals and video stimuli by incorporating both high-level semantic and low-level perceptual content. The framework primarily consists of two key components: the Perception Reconstructor (PR) and the Semantics Reconstructor (SR).

Perception Reconstructor (PR): This component is tasked with generating a 'blurry' video that encapsulates low-level perceptual flows derived from fMRI signals. The PR ensures the smoothness and consistency of the generated video through modules such as Inception Extension and Temporal Upsampling. The focus here is on aligning multi-dimensional fMRI embeddings with video frames, thus capturing motions and dynamic scenes without focusing on semantic detail.
Semantics Reconstructor (SR): In contrast, the SR focuses on reconstructing high-quality keyframes that capture the semantic richness of the visual stimuli. By utilizing advanced techniques such as diffusion prior and contrastive learning, the SR aligns fMRI-derived embeddings with image space representations, benefiting from the CLIP model’s capacity to integrate multiple modalities.

During the inference process, NeuroClips employs a text-to-video (T2V) diffusion model. Here, the reconstructed keyframes and low-level perception flows guide the generation of videos with high fidelity and consistency.

Experimental Validation and Results

Extensive evaluations against state-of-the-art (SOTA) models on a publicly available fMRI-video dataset reveal that NeuroClips delivers substantial improvements across various metrics. Notably, NeuroClips achieves a 128% improvement in the Structural Similarity Index Measure (SSIM) and an 81% enhancement in spatiotemporal metrics. Such improvements underscore the efficacy of combining low-level perceptual information with high-level semantic reconstruction in generating coherent and high-fidelity videos from fMRI inputs.

Implications and Future Directions

The implications of this research are far-reaching, both from a theoretical and practical perspective. The ability to decode video from fMRI can significantly advance our understanding of human visual perception and cognition. It offers promising applications in fields such as neuroimaging, cognitive neuroscience, and even potential interfaces for brain-machine communication.

Nevertheless, certain limitations persist, particularly concerning the framework's ability to handle cross-scene fMRI recordings or when dealing with out-of-distribution data. Future research directions could focus on enhancing the model's generalizability across diverse datasets and exploring the potential to support longer video reconstructions seamlessly. Additionally, improvements in the computational efficiency of video generative models could broaden the applicability of such frameworks in real-world scenarios.

In summary, NeuroClips represents a significant advancement in the domain of fMRI-to-video decoding. Through sophisticated integration of perceptual flows and semantic learning, it sets a new benchmark for reconstructing videos from brain activity, paving the way for future breakthroughs in the understanding and application of brain-machine interfacing technologies.

PDF Markdown

YouTube

Show All Videos