Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 69 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 37 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 119 tok/s Pro
Kimi K2 218 tok/s Pro
GPT OSS 120B 456 tok/s Pro
Claude Sonnet 4.5 33 tok/s Pro
2000 character limit reached

TVSD: Ventral Stream Spiking Dataset

Updated 10 October 2025
  • TVSD is a comprehensive dataset capturing high-density spiking activity from macaque V1, V4, and IT, enabling detailed study of visual encoding.
  • It utilizes ridge regression to decode neural responses into latent visual features, facilitating both semantic and low-level image reconstructions.
  • The framework reveals hierarchical processing and functional clustering in the primate ventral stream, offering insights into spatiotemporal visual perception.

The THINGS Ventral Stream Spiking Dataset (TVSD) is a large-scale neurophysiological resource designed to elucidate the mechanisms by which distributed neural populations in the primate ventral visual stream encode, represent, and reconstruct visual information. TVSD consists of high-density multi-electrode array recordings from macaque monkeys exposed to tens of thousands of natural images drawn from the THINGS database. The dataset’s multiareal coverage (spanning V1, V4, and inferotemporal [IT] cortex), fine temporal granularity, and systematic behavioral protocols underpin a comprehensive computational framework for interpreting both the low-level and semantic aspects of visual perception in the primate brain.

1. Data Acquisition and Structure

TVSD comprises spiking data collected via Utah arrays arranged in an 8×8 configuration, chronically implanted in three critical zones of the ventral stream: primary visual cortex (V1), extrastriate cortex (V4), and inferotemporal cortex (IT). In typical experimental setup, recordings from two example subjects involved 15–16 arrays distributed across these subdivisions (e.g., Monkey N: 7 in V1, 4 in V4, 4 in IT; Monkey F: similar distribution). Each monkey was exposed to over 22,000 unique images sourced from the THINGS image database, with each stimulus presented for 200 ms followed by a 200 ms inter-stimulus interval. Multiple repetitions for each image ensured data reliability. The binned spike counts were normalized and arranged by electrode and temporal window (typical bin sizes: 25 ms or 40 ms), yielding spatiotemporal population responses amenable to statistical modeling.

2. Linear Decoding to Visual Latent Spaces

The central computational step leverages linear regression models—specifically, ridge regression—to decode the high-dimensional binned spiking activity into latent representations that reflect semantic or low-level visual properties. The decoder transforms a vector ss (spike counts by electrode and time bin) into a predicted latent vector z^\hat{z} via:

z^=Ws+b\hat{z} = W\, s + b

where WW is the learned weight matrix and bb is a bias term. The model is trained to minimize the mean squared error with L2L_2 regularization:

L(W,b)=z(Ws+b)22+λW22L(W, b) = \| z - (W\, s + b) \|_2^2 + \lambda \|W\|_2^2

The target latent zz is obtained by passing the stimulus image through a visual encoder such as the CLIP-Vision transformer (for semantic embeddings) or VDVAE (for low-level features such as texture or contrast). This process allows the extraction of latent descriptions directly from neural population activity for subsequent image reconstruction tasks.

3. Generative Reconstruction and Decoding Pipeline

Predicted latent embeddings are used as conditioning variables for state-of-the-art generative models. Semantic reconstructions employ the unCLIP model, which accepts two conditioning latent spaces: a VAE-derived latent (encoding basic image structure) and a CLIP latent (providing semantic guidance). The unCLIP diffusion process reconstructs images that preserve both high-level categorical and low-level textural features present in the neural activity. Analogous pipelines decode neural activity into VDVAE or PCA spaces, yielding reconstructions that emphasize alternative aspects: VDVAE reconstructions prioritize texture, whereas PCA-based reconstructions tend to emphasize luminance and global image statistics.

4. Encoding Models and Neural Preference Visualization

TVSD also facilitates encoding: mapping visual latents back to the spiking domain to characterize the neural correlates of specific image features. An encoding model projects latent vectors zz to predicted spike patterns s^\hat{s} as

s^=Vz+c\hat{s} = V\, z + c

where VV is the encoding weight matrix and cc is a bias. The encoding model reveals how spatiotemporal neural firing patterns covary with particular visual features. Crucially, the encoding weights for individual electrodes/time bins (which share the dimensionality of CLIP latent vectors) are interpreted as signatures of the preferred visual features for those neural ensembles. Decoding these weights using unCLIP yields “preferred stimuli”: idealized images that maximally activate specific electrodes at given time points. Spatiotemporal mapping of these preferred stimuli delineates how neural preference evolves, with posterior IT showing early selectivity for simple features (geometric patterns, 75–100 ms) and anterior IT manifesting selectivity for complex categorical entities (faces, 125–150 ms).

5. Functional Clustering and Hierarchical Organization

Weight similarity analysis via hierarchical clustering (based on cosine similarity of encoding or decoding vectors) uncovers functionally coherent clusters of electrodes with shared tuning characteristics. These clusters align with categorical divisions in the stimulus domain (e.g., ensembles preferentially responsive to “animal” versus “food” images, or to “textured” versus “smooth” stimuli). The hierarchical anatomical organization of ventral stream responses is evident: early V1/V4 or posterior IT encodes simple, low-level features in earlier temporal bins, while anterior IT progressively encodes higher-level, semantic content over later time windows. Temporal dynamics mapped at fine granularity (25–40 ms bins) reflect evolution of selectivity and neural clustering across both time and cortical territory.

Region Temporal Selectivity Feature Preference
V1/V4 Early (25–75 ms) Color, texture, pattern
Posterior IT Early-Mid (75–100 ms) Geometric forms
Anterior IT Late (125–150 ms) Faces, semantic objects

6. Evaluation Metrics and Comparative Analysis

Reconstruction performance is quantitatively evaluated using image-wise similarity metrics. Pixel Correlation (PixCorr) is calculated as the Pearson correlation coefficient between vectorized reconstructed and ground-truth images:

PixCorr=cov(vec(recon),vec(GT))σreconσGT\text{PixCorr} = \frac{\mathrm{cov}(\mathrm{vec}(\text{recon}), \mathrm{vec}(\text{GT}))}{\sigma_{\text{recon}}\, \sigma_{\text{GT}}}

Structural Similarity (SSIM) is employed following the luminance, contrast, and structure-based approach of Wang et al. (2004). Furthermore, comparison scores using pretrained image classification and embedding networks (AlexNet, InceptionNet, CLIP, EfficientNet, SwAV) quantitatively validate that decoded image representations preserve key perceptual attributes across both low- and high-level feature domains. The decoding and encoding models yield consistent performance across individuals and experimental runs, supporting the generalizability of the linear model-based approach.

7. Implications and Research Trajectory

TVSD advances the capacity to reconstruct perceptual experience from large-scale neural recordings, bridging the gap between neural population activity and subjective visual representation. It provides direct evidence of the hierarchical and feature-specific organization of the primate ventral stream, reveals spatial and temporal maps of evolving neural preference, and demonstrates the efficacy of linear encoding/decoding models integrated with generative image models. A plausible implication is the utility of TVSD for benchmarking future machine learning approaches aimed at linking neuronal patterns with perception, and for probing the biophysical basis of semantic and categorical representation in visual cortex. The use of thousands of distinct, naturalistic images positions this dataset as foundational for future cross-modal neural decoding and for the development of interpretable brain-computer interface paradigms in vision neuroscience.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to THINGS Ventral Stream Spiking Dataset (TVSD).