SpikeVAEDiff: Neural Spike-based Natural Visual Scene Reconstruction via VD-VAE and Versatile Diffusion

Published 14 Jan 2026 in cs.CV and cs.AI | (2601.09213v1)

Abstract: Reconstructing natural visual scenes from neural activity is a key challenge in neuroscience and computer vision. We present SpikeVAEDiff, a novel two-stage framework that combines a Very Deep Variational Autoencoder (VDVAE) and the Versatile Diffusion model to generate high-resolution and semantically meaningful image reconstructions from neural spike data. In the first stage, VDVAE produces low-resolution preliminary reconstructions by mapping neural spike signals to latent representations. In the second stage, regression models map neural spike signals to CLIP-Vision and CLIP-Text features, enabling Versatile Diffusion to refine the images via image-to-image generation. We evaluate our approach on the Allen Visual Coding-Neuropixels dataset and analyze different brain regions. Our results show that the VISI region exhibits the most prominent activation and plays a key role in reconstruction quality. We present both successful and unsuccessful reconstruction examples, reflecting the challenges of decoding neural activity. Compared with fMRI-based approaches, spike data provides superior temporal and spatial resolution. We further validate the effectiveness of the VDVAE model and conduct ablation studies demonstrating that data from specific brain regions significantly enhances reconstruction performance.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces SpikeVAEDiff, a framework that reconstructs natural visual scenes from neural spike data.
It employs a two-stage process combining a Very Deep VAE for initial low-resolution mapping and Versatile Diffusion for image refinement.
Experimental findings using Neuropixels data reveal superior detail and semantic accuracy compared to traditional fMRI methods.

SpikeVAEDiff: Neural Spike-based Natural Visual Scene Reconstruction

Introduction

The paper "SpikeVAEDiff: Neural Spike-based Natural Visual Scene Reconstruction via VD-VAE and Versatile Diffusion" (2601.09213) presents SpikeVAEDiff, a novel framework for reconstructing natural visual scenes from neural spike data. This study marries neuroscience and computer vision by leveraging advanced generative models to decode visual information embedded in neural activity. SpikeVAEDiff utilizes a two-stage approach combining a Very Deep Variational Autoencoder (VDVAE) for initial low-resolution reconstruction with the Versatile Diffusion model for image refinement. This paper provides insights into the neural spike data's potential, especially given its temporal and spatial resolution advantages over traditional fMRI signals.

Neural Signals and Reconstruction Framework

Neural Signal Sources

Neuropixels—extracellular electrodes—and fMRI provide distinct neural activity data, each with unique trade-offs. While fMRI offers broader spatial data, its temporal precision falters compared to the high temporal resolution and direct neuron recording capabilities of neural spikes. Spike data is crucial for precise visual decodings, as reflected in this research focus on neural spike-based imaging [waldert2009review].

Generative Models

VDVAEs structure latent space more hierarchically than traditional VAEs, improving input data reconstruction clarity. GANs and DMs, like the Latent Diffusion Models, are acknowledged for superior results in high-resolution context-aware image generation. DMs, with iterative denoising strategies, bring semantic accuracies, forming the foundation of SpikeVAEDiff's approach by leveraging the LDM extension [rombach2022high]—suitable for spike-driven scene reconstructions.

Methodology

Stage One: Initial Reconstruction with VDVAE

SpikeVAEDiff's initial stage deploys VDVAE, training a regression model to map spike signals to latent variables extracted via a pre-trained VDVAE applied to natural scene data. This stage produces low-resolution initial guesses, setting the stage for final refinement.

Figure 1: Scheme of SpikeVAEDiff

Figure 1: The overall structure of the SpikeVAEDiff pipeline.

The second stage employs Versatile Diffusion's capabilities, conditioned on multimodal CLIP features derived from spike data. By mapping spikes to CLIP-Vision and CLIP-Text features, the diffusion model refines initial low-res images into high-fidelity structure and content [goodfellow2014generative].

Figures 5 & 6: Reconstruction Examples

Figure 2: Examples of spikes reconstructions from our model.

Figure 3: Failure cases of spikes reconstructions from our model.

Experimental Findings

Dataset and Brain Region Insights

Using the Allen Visual Coding—Neuropixels dataset, this research highlights differential spike activations across brain regions when processing stimuli. The spike contributions from primary regions such as VISI, as opposed to broader fMRI data, prove critical in capturing fine visual detail.

Figure 4: Regional Activation

Figure 4: Peristimulus Time Histograms for different brain regions on stimulus.

Reconstruction Fidelity

SpikeVAEDiff significantly enhances the structural integrity and semantic accuracy of reconstructed images, outperforming previous methods [ozcelik2022reconstruction]. However, challenges remain in reconstructing complex image elements, influenced by backdrop and foreground interplay.

Discussion

The framework SpikeVAEDiff advances neurocomputational efforts by integrating neural spikes with state-of-the-art generative models. It shows neural spikes' potential to inform visual reconstructions' high fidelity, encouraging the exploration of neural data beyond traditional fMRI constraints. Furthermore, understanding the role of specific brain regions could refine decoding technologies. Future work could probe into cross-modality enhancements, incorporating EEG data, thereby augmenting the reconstruction fidelity of specific visual features like motion and orientation.

Conclusion

SpikeVAEDiff demonstrates a substantial leap in utilizing spike data for high-resolution neural decoding, merging visual neuroscience with advanced generative models. This integration endows machines with unprecedented capability in decoding and representing neural activities, opening avenues for sophisticated brain-computer interfaces and further neuroscience inquiry, potentially revolutionizing our understanding and reconstruction of visual stimuli from neural signals.

Markdown Report Issue