Zero-Shot Brain-to-Image Retrieval

Updated 21 November 2025

Zero-shot brain-to-image retrieval is a method that maps unseen neural signals to images using shared embedding spaces, enabling retrieval from large candidate galleries.
It leverages techniques such as cognitive priors, uncertainty-aware blur, and sparse translators to mitigate data scarcity and bridge cross-modal semantic gaps.
The approach utilizes cross-modal alignment and adversarial domain adaptation to achieve generalization across subjects, modalities, and diverse neural measurements.

Zero-shot brain-to-image retrieval refers to the task of recovering or matching natural images corresponding to a previously unseen brain signal (EEG, fMRI, or MEG), drawn from an unseen semantic category, without using class-specific supervision for those images during training. This paradigm operationalizes a stringent form of “neural decoding” that emphasizes generalization—models must map high-dimensional, noisy brain measurements into a common representational space (typically CLIP or other vision-language embeddings), enabling direct nearest-neighbor search among large external image galleries. Recent advances address key challenges of data scarcity, inherent cross-modal gaps, and individual variability through a combination of self-supervised learning, bridge embeddings, uncertainty modeling, adversarial domain adaptation, and sparse-translator theory.

1. Problem Formulation and Motivation

In zero-shot brain-to-image retrieval, the goal is to infer the identity of a visual stimulus from held-out neural data, such that during training, the mapping from brain signals to images (or their features) is never exposed to the target class (Liu et al., 2023, Zhang et al., 10 Nov 2025). Formally, given test-time brain signals $x_b^q$ (e.g., EEG trial, fMRI activation), the system must retrieve the associated image $x_v$ from a novel, large candidate pool $\mathcal{G}$ , none of whose labels occurred during model learning.

This framework directly addresses two scientific and engineering obstacles:

Cross-modal semantic gap: Biological neural signals (EEG, fMRI) encode rich but ambiguous, noisy information; supervised data for their alignment with visual features is limited, especially per-class.
Generalization requirement: Zero-shot means models must avoid trivial memorization or label overfitting, instead learning representations and mappings that transfer across broad visual and neurocognitive domains (Otsuka et al., 19 Sep 2025).

The relevance of this paradigm extends to cognitive neuroscience, BCI, and semantic-level visual experience quantification.

2. Core Approaches and Architectures

Zero-shot retrieval pipelines are typically composed of three core components: (1) neural encoders that map brain signals into a shared feature space; (2) visual feature extractors (often CLIP encoders) used for index/gallery construction; and (3) cross-modal alignment strategies. The following table summarizes key frameworks and their distinctive innovations.

Framework	Neural Modality	Core Alignment Method	Notable Innovations
NeuroBridge (Zhang et al., 10 Nov 2025)	EEG	Bidirectional InfoNCE	Cognitive Prior Augmentation (CPA); Shared Semantic Projector (SSP)
UBP (Wu et al., 6 Mar 2025)	EEG/MEG	Symmetric contrastive	Uncertainty-aware, foveated blur prior
BrainCLIP (Liu et al., 2023)	fMRI	(Image, Text)-contrastive	CLIP as “pivot” embedding; hybrid visual/textual targets
ZEBRA (Wang et al., 31 Oct 2025)	fMRI	Adversarial disentanglement	Subject-invariant semantic factors; cross-subject generalization
Sparse Translator (Otsuka et al., 19 Sep 2025)	fMRI, EEG	Ridge w/ variable selection	Sparse regression to avoid output-dimension collapse

Processing and Alignment Architecture

Feature Extraction: Most methods employ neural encoders $f_B(\cdot)$ for brain data (CNN, MLP, ViT depending on EEG/fMRI) and freeze a pretrained vision model $f_V(\cdot)$ (usually CLIP-based).
Projection: Mappings $p_B$ and $p_V$ project encoded features into a common space of moderate dimension ( $d=512$ or $d=768$ ).
Similarity Computation: Cosine or Euclidean distance is used for gallery retrieval.
Alignment Objective: Contrastive InfoNCE (Zhang et al., 10 Nov 2025, Liu et al., 2023), symmetric cross-entropy (Wu et al., 6 Mar 2025), and domain-adversarial losses (Wang et al., 31 Oct 2025) drive semantic alignment.

3.1 Cognitive Priors and Data Augmentation (CPA)

NeuroBridge introduces CPA, simulating human perceptual variability by augmenting both EEG (temporal smoothing, channel dropout) and images (blur, low-res, mosaic) per training pair. This increases effective sample diversity and aligns feature invariances closer to the neurobiological regime (Zhang et al., 10 Nov 2025). Ablation shows removal of CPA from either modality sharply reduces retrieval accuracy (e.g., Top-1 drops from 63.2% to 40.8% in EEG-only).

3.2 Uncertainty-Aware Blur Priors (UBP)

UBP dynamically estimates the “semantic uncertainty” (cosine similarity) between each brain-image pair within a minibatch, then applies an adaptive foveated blur to the image input as a function of the measured uncertainty (Wu et al., 6 Mar 2025). This specifically targets two sources of misalignment:

System GAP: Irreversible perceptual loss (achromatic, foveated, or spatially filtered details).
Random GAP: Stochastic cognitive/noise perturbations.

UBP reduces the impact of noise and overfitting, yielding strong gains (Top-1 improves from 37.2% to 50.9% on THINGS-EEG).

3.3 Sparse Translators for Small-Data Regimes

Naïve regression from brain to latent space fails catastrophically when sample count $n$ is less than the output dimension $d_{out}$ , leading to “output dimension collapse” (Otsuka et al., 19 Sep 2025). Theoretical work clarifies that only variable-selection–based sparse translators—where each output feature is regressed from a modest subset of brain predictors—avoid this, with prediction error governed by the formula

$R \approx \sigma^2_{\text{eff}}[1 + (2\delta)^{-1}\{\ldots\}]$

as a function of data scale $\alpha = n/d_{out}$ and sparsity $s \ll d$ . This provides rigorous guidance for data-efficient, generalizable retrieval.

3.4 Domain-Invariant and Semantic Disentanglement

ZEBRA isolates subject-invariant and semantic components within the learned fMRI representations via adversarial training, ensuring that the semantic subspace aligned to CLIP is invariant across subjects (Wang et al., 31 Oct 2025). The fMRI latent is decomposed

$E = E_i + E_s,$

where $E_i$ (invariant) is aligned to CLIP with a diffusion prior, and $E_s$ is suppressed with respect to subject identifiers via a gradient reversal layer. This domain adaptation critically enables cross-subject transfer in zero-shot scenarios.

4. Training, Evaluation Protocols, and Quantitative Results

4.1 Training Regimes

Data Splits: Training is typically on $1,654$ classes (THINGS-EEG, NSD), each with several repeats; 200 classes reserved for zero-shot, completely held-out retrieval (Zhang et al., 10 Nov 2025, Wu et al., 6 Mar 2025).
Losses: Contrastive InfoNCE (bidirectional or symmetric cross-entropy); across brain-image pairs within a batch.

4.2 Zero-Shot Retrieval Protocol

For each test-time brain signal, its embedding is computed and compared (by cosine similarity) to the embeddings of all candidates in the gallery (size e.g. 200).
Performance metric: Top-1 and Top-5 accuracy, i.e., the fraction of queries for which the true image appears first or within the top five ranked.
Ablation: Evaluation also monitors the effect of omitting augmentation/prior components, and of varying encoder architecture or normalization asymmetries.

4.3 Notable Results

Method	Modality	Test Regime	Top-1 (%)	Top-5 (%)	Dataset
UBP (Wu et al., 6 Mar 2025)	EEG	Intra-subject	50.9	79.7	THINGS-EEG
NeuroBridge (Zhang et al., 10 Nov 2025)	EEG	Intra-subject	63.2	89.9	THINGS-EEG
		Inter-subject	19.0	45.9	THINGS-EEG
ZEBRA (Wang et al., 31 Oct 2025)	fMRI	Zero-shot (cross-subject)	81.2 (Top-5)	--	NSD
BrainCLIP (Liu et al., 2023)	fMRI	Zero-shot (NSD, 982 candidates)	27.5	57.1	NSD

These results indicate that advanced augmentation (CPA, UBP) and disentanglement (ZEBRA) yield substantially improved zero-shot performance compared to previous contrastive or ridge baselines.

5. Methodological Advances and Empirical Insights

5.1 Ablations and Sensitivity

CPA: Removing image or EEG priors in NeuroBridge sharply reduces Top-1 retrieval, indicating both augmentations are essential (Zhang et al., 10 Nov 2025).
UBP: Dynamic blur outperforms static or no blur; using uncertainty to “gate” input fidelity prevents noisy trials from corrupting alignment (Wu et al., 6 Mar 2025).
Normalization: Asymmetric normalization (only images) yields higher accuracy than either symmetric or EEG-only normalization (Zhang et al., 10 Nov 2025).
Sparse Regression: For low data scales ( $\alpha < 0.1$ ), only sparse translators maintain nontrivial prediction error and zero-shot generalization (Otsuka et al., 19 Sep 2025).

5.2 Modality-Specific Factors

CLIP-based models remain robust across a range of encoders, with ViT and ResNet backbones showing similar patterns (Wu et al., 6 Mar 2025, Liu et al., 2023).
EEG smoothing emerges as the most beneficial augmentation among possible signal corruptions, likely reflecting denoising at low SNR (Zhang et al., 10 Nov 2025).
Cross-modal contrastive alignment, combining both image and caption supervision, improves semantic transfer at some expense to low-level image matching (Liu et al., 2023).

6. Generalization Across Subjects, Modalities, and Tasks

ZEBRA’s adversarial factorization allows, for the first time, zero-shot cross-subject fMRI-to-image retrieval without fine-tuning, by ensuring subject-invariant semantic representations (Wang et al., 31 Oct 2025). In EEG, both intra- and inter-subject (leave-one-out) settings are evaluated, with still-substantial accuracy degradation for cross-subject transfer, though approaches such as UBP and NeuroBridge narrow the gap. Extensions to MEG (with UBP) demonstrate that methods developed for EEG generalize with only modifications in channel selection and trial averaging (Wu et al., 6 Mar 2025). A plausible implication is that advances in representation disentanglement and signal-invariant features will further close these modality and individual differences.

7. Open Challenges and Prospects

Zero-shot brain-to-image retrieval research faces ongoing challenges in aligning the granularity, robustness, and universality of neural-to-visual mappings:

Data scarcity: Empirical and theoretical work maintain that sparse regression and feature selection are necessary at low sample sizes. Improving brain-to-feature mapping with minimal supervision remains a critical area (Otsuka et al., 19 Sep 2025).
Semantic/subject disentanglement: ZEBRA demonstrates cross-subject generalization in fMRI, but adaptation-free transfer for EEG and other less spatially resolved modalities remains difficult.
Modality-agnostic transfer: While CLIP-based “pivot” spaces are now standard, bridging transcriptomics, text, and other task signals with brain imaging remains a largely open challenge (Liu et al., 2023).
Evaluation standards: Most work uses Top-1 and Top-5 accuracy; additional metrics (mAP, mean reciprocal rank, feature-wise correlations) may be needed for capturing semantic and perceptual fidelity (Otsuka et al., 19 Sep 2025, Wang et al., 31 Oct 2025).

Promising research avenues include advanced self-supervised alignment, biologically-inspired augmentation tailored to neural noise properties, universal multi-subject training schemes, and direct integration of generative priors for image synthesis and open-ended concept retrieval.

References:

NeuroBridge (Zhang et al., 10 Nov 2025)
Bridging the Vision-Brain Gap with an Uncertainty-Aware Blur Prior (Wu et al., 6 Mar 2025)
Overcoming Output Dimension Collapse: How Sparsity Enables Zero-shot Brain-to-Image Reconstruction at Small Data Scales (Otsuka et al., 19 Sep 2025)
BrainCLIP: Bridging Brain and Visual-Linguistic Representation Via CLIP (Liu et al., 2023)
ZEBRA: Towards Zero-Shot Cross-Subject Generalization for Universal Brain Visual Decoding (Wang et al., 31 Oct 2025)