CATVis EEG2IMAGE Framework

Updated 15 September 2025

The paper introduces CATVis, a five-stage framework that decodes visual representations from noisy EEG signals using a supervised Conformer encoder and cross-modal contrastive learning.
The method enhances concept classification by 13.43% and leverages retrieval-based caption re-ranking to refine context-aware outputs for improved image synthesis.
Through weighted semantic interpolation and a pre-trained Stable Diffusion model, the approach achieves high-fidelity, semantically faithful image generation with reduced FID and improved Inception Score.

The EEG2IMAGE framework referred to as CATVis (“Context-Aware Thought Visualization”) is a five-stage methodology developed to decode visual representations from human electroencephalogram (EEG) signals. Its design prioritizes accurate concept prediction, cross-modal semantic alignment, context-aware captioning, and high-fidelity image synthesis, leveraging both neural and linguistic information via large pre-trained models. CATVis advances EEG-to-image generation by explicitly integrating classifier-driven concept prediction, cross-modal contrastive learning, flexible semantic conditioning, and re-ranking strategies within a modern generative pipeline.

1. Spatio-Temporal Neural Encoding for Concept Classification

The CATVis framework begins with a supervised Conformer-based EEG encoder, specifically designed to address the complex, noisy, and distributed nature of EEG signals elicited by visual stimuli. The architecture consists of:

Spatio-temporal convolutional blocks: An initial temporal convolution layer (e.g., 40 kernels of size [1, 25]) is applied across each of the 128 EEG channels to extract localized temporal features. Spatial convolution kernels ([channels, 1]) then operate to capture inter-channel (spatial) dependencies, highlighting topographical patterns relevant to stimulus encoding.
Stacked Transformer encoder: The feature maps from convolutional processing are passed into a multi-layer Transformer module. This component models long-range dependencies, incorporating self-attention across all time positions, which is critical for aggregating discriminative evidence over distributed neural representations.
Classification head: A fully connected layer at the output stage predicts stimulus class (e.g., among 40 visual concepts).

Compared to earlier LSTM- or shallow CNN-based encoders, this combination enables robust extraction of both localized and distributed neural codes. The experimental results report a 13.43% improvement in classification accuracy over previous methods, establishing strong discriminability at the concept level.

CATVis employs a cross-modal contrastive learning regime to map EEG-derived embeddings and linguistic embeddings into a unified 768-dimensional CLIP space:

The Conformer encoder’s classification layer is replaced with a linear projection head followed by L2 normalization to generate EEG embeddings directly comparable with CLIP text features.
Text embeddings are obtained using CLIP’s pre-trained ViT-L/14 text encoder.
The contrastive InfoNCE loss is used, symmetrically applied in both EEG→Text and Text→EEG directions:

$\mathcal{L} = \frac{1}{2}(\mathcal{L}_{\text{EEG}\rightarrow\text{Text}} + \mathcal{L}_{\text{Text}\rightarrow\text{EEG}})$

Where:

$\mathcal{L}_{\text{EEG}\rightarrow\text{Text}} = -\sum_{i=1}^{B} \log\left(\frac{\exp(e^{\text{EEG}}_i \cdot e^{\text{Text}}_i / \tau)}{\sum_{j=1}^B \exp(e^{\text{EEG}}_i \cdot e^{\text{Text}}_j / \tau)}\right)$

with $e^{\text{EEG}}_i$ and $e^{\text{Text}}_i$ denoting respective normalized embeddings, $\tau$ the temperature, and $B$ the batch size.

This contrastive alignment ensures the learned EEG representations are semantically consistent with the textual descriptions of visual stimuli, providing a critical bridge for subsequent text-guided generation and retrieval operations.

To address the limited granularity of pure neural class prediction, CATVis introduces a two-stage contextual captioning process:

Initial caption retrieval: The EEG embedding is used as a query to retrieve the top- $k$ most similar textual caption embeddings (from a pre-encoded pool) using cosine similarity. Formally,

$\text{Top-}k\{\arg\max_{x \in \mathcal{X}} \cos(e^{\text{EEG}}, e^{\text{Text}})\}$

Class-guided re-ranking: Candidate captions are re-ranked according to their CLIP-space similarity with the predicted class embedding from the EEG classifier, ensuring the final caption is both contextually informative and semantically anchored to the primary concept.

Empirical results show this refinement pipeline improves generation accuracy by nearly 7% and reduces Fréchet Inception Distance (FID), reflecting higher-quality and more contextually faithful image outputs.

4. Weighted Semantic Interpolation for Conditioning

To balance object identity and contextual detail, CATVis computes a weighted interpolation of the (predicted) class embedding and the contextually-refined caption embedding, both in CLIP space:

Let $e^{(\text{class})} \in \mathbb{R}^{768}$ be the predicted class embedding (from the EEG classifier), and $e^{(\text{text})} \in \mathbb{R}^{768}$ be the top-ranked caption embedding.
The final conditioning vector is:

$z = \lambda \cdot e^{(\text{class})} + (1 - \lambda) \cdot e^{(\text{text})}$

$\lambda$ is sampled from a $\text{Beta}(10,10)$ prior (yielding a distribution peaked at 0.5), ensuring neither source dominates and maximizing diversity in semantic content.

This approach enables dynamic control over the object-context blend supplied to the generator and empirically produces higher semantic fidelity and expressive diversity in visual outputs.

5. Image Synthesis via Pre-Trained Diffusion Model

Image generation is carried out using a pre-trained Stable Diffusion (v1-5) latent diffusion model:

The interpolated semantic vector $z$ conditions the UNet’s cross-attention blocks, directing the denoising process of the latent representation in accordance with both the primary concept and refined contextual information.
Stable Diffusion operates in a compressed latent space, making the process computationally efficient while maintaining detailed visual output.
The final resolved latent is decoded by a variational autoencoder (VAE) to produce the pixel-level image.

This methodology benefits from the expressiveness and generative power of large-scale diffusion models, amplified by CLIP-space conditioning.

6. Experimental Results and Metrics

CATVis demonstrates strong empirical performance with substantial improvements on multiple frontiers:

Metric	CATVis Result	Reported SOTA Gain over Baselines
EEG Classification Top-1	61.09%	+13.43% over prior methods (BrainVis etc)
Generation Accuracy (GA)	0.5678	+15.21% on average GA
Inception Score (IS)	37.5	--
Fréchet Inception Distance	↓36.61%	Compared to leading alternatives

Classification, generation accuracy, IS, and FID jointly validate the method’s capacity for faithful concept prediction, plausible image synthesis, and improved semantic alignment.

7. Comparative Innovations and Positioning

CATVis introduces several distinctions relative to prior works:

Efficient supervised Conformer encoder for EEG, in contrast to earlier LSTM/CNN-based or heavily self-supervised architectures.
Contrastive cross-modal alignment, enabling the EEG modality to interact with rich text and image semantics, outperforming approaches that rely solely on neural or visual features.
Caption re-ranking and semantic interpolation provide explicit mechanisms for enhancing context and supporting the ambiguity/incompleteness of EEG signals.
End-to-end interpretability is enhanced via explicit compositionality: every contribution—from concept decoding to caption refinement—maps to a distinct architectural step.

Additionally, CATVis achieves state-of-the-art results while avoiding prohibitive computational costs associated with cascaded diffusion stacks or multi-step pretraining. Its multi-stage, semantically-aware design demonstrably improves both classification and generation, supporting applications ranging from cognitive state monitoring to next-generation BCI-based visual reconstruction.

In summary, CATVis exemplifies a modern EEG2IMAGE framework, integrating modern neural architectures, cross-modal semantic embedding strategies, and conditioned diffusion-based synthesis to achieve context-aware thought visualization, with a strong emphasis on semantic faithfulness, flexibility, and extensibility (Mehmood et al., 15 Jul 2025).

PDF Markdown Chat (Pro)

References (1)

CATVis: Context-Aware Thought Visualization (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to EEG2IMAGE Framework.

CATVis EEG2IMAGE Framework

1. Spatio-Temporal Neural Encoding for Concept Classification

3. Caption Refinement via Retrieval and Re-Ranking

4. Weighted Semantic Interpolation for Conditioning

5. Image Synthesis via Pre-Trained Diffusion Model

6. Experimental Results and Metrics

7. Comparative Innovations and Positioning

Whiteboard

Follow Topic

Continue Learning

CATVis EEG2IMAGE Framework

1. Spatio-Temporal Neural Encoding for Concept Classification

2. Cross-Modal Alignment in the CLIP Feature Space

3. Caption Refinement via Retrieval and Re-Ranking

4. Weighted Semantic Interpolation for Conditioning

5. Image Synthesis via Pre-Trained Diffusion Model

6. Experimental Results and Metrics

7. Comparative Innovations and Positioning

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics