Subject-Aware Multi-Granularity Alignment for Zero-Shot EEG-to-Image Retrieval

Published 20 Apr 2026 in cs.CV | (2604.17782v1)

Abstract: Zero-shot EEG-to-image retrieval aims to decode perceived visual content from electroencephalography (EEG) by aligning neural responses with pretrained visual representations, providing a promising route toward scalable visual neural decoding and practical brain-computer interfaces. However, robust EEG-to-image retrieval remains challenging, because prior methods usually rely on either a single fixed visual target or a subject-invariant target construction scheme. Such designs overlook two important properties of visually evoked EEG signals: they preserve information across multiple representational scales, and the visual granularity best matched to EEG may vary across subjects. To address these issues, subject-aware multi-granularity alignment (SAMGA) framework is proposed for zero-shot EEG-to-image retrieval. SAMGA first constructs a subject-aware visual supervision target by adaptively aggregating multiple intermediate representations from a pretrained vision encoder, allowing the model to absorb subject-dependent granularity deviations during training while preserving subject-agnostic inference. Building on this adaptive target construction, a coarse-to-fine cross-modal alignment strategy is further designed with a shared encoder wherein the coarse stage stabilizes the shared semantic geometry and reduces subject-induced distribution shift, and the fine stage further improves instance-level retrieval discrimination. Extensive experiments on the THINGS-EEG benchmark demonstrate that the proposed method achieves 91.3% Top-1 and 98.8% Top-5 accuracy in the intra-subject setting, and 34.4% Top-1 and 64.8% Top-5 accuracy in the inter-subject setting, outperforming recent state-of-the-art methods.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces the SAMGA framework, enhancing EEG-to-image retrieval accuracy by addressing subject-specific granularity deviations.
SAMGA employs a subject-aware target construction and a unique coarse-to-fine alignment strategy, achieving 91.3% Top-1 accuracy in subject settings.
Improvements highlight SAMGA's potential in BCIs, aiding scalable visual neural decoding while suggesting future granularity adaptation refinement.

Subject-Aware Multi-Granularity Alignment for Zero-Shot EEG-to-Image Retrieval

The work presented in "Subject-Aware Multi-Granularity Alignment for Zero-Shot EEG-to-Image Retrieval" (2604.17782) introduces the Subject-Aware Multi-Granularity Alignment (SAMGA) framework, designed to address challenges in zero-shot EEG-to-image retrieval. This framework focuses on enhancing the alignment of visually evoked electroencephalography (EEG) signals with pretrained visual representations by explicitly accounting for the multi-scale and subject-specific nature of EEG data. The primary limitations of prior methods, which often rely on fixed visual targets or subject-invariant target construction, are overcome by SAMGA's adaptive approach.

Subject-Aware Multi-Granularity Target Construction

A central tenet of the SAMGA framework is the construction of a subject-aware multi-granularity visual supervision target. Visually evoked EEG signals inherently contain information across various representational scales, and the optimal visual granularity for alignment can vary significantly across individuals. To accommodate this, SAMGA adaptively aggregates multiple intermediate representations from a frozen, pretrained vision encoder. This design enables the model to absorb subject-dependent granularity deviations during training while maintaining subject-agnostic inference capabilities.

The framework employs a subject-aware router that modulates the contributions of different intermediate visual layers. This router is not sample-specific but rather models the routing distribution as a combination of a global granularity prior and a subject-specific deviation term. During training, subject dropout is implemented to prevent an over-reliance on individual subject biases, while layer-level dropout ensures that the routing distribution does not collapse onto a limited subset of layers. This mechanism allows for the learning of a flexible visual target that is better matched to the multi-scale information present in EEG signals. During inference, only the global routing prior is utilized, ensuring generalizability to unseen subjects without requiring explicit subject identity. The effectiveness of this multi-layer fusion mechanism is quantitatively demonstrated through an ablation study, where learned fusion consistently outperforms both single best-layer alignment and uniform multi-layer fusion, achieving 91.3% Top-1 and 98.8% Top-5 accuracy in the intra-subject setting.

Building upon the subject-aware visual target, SAMGA incorporates a coarse-to-fine cross-modal alignment strategy utilizing a shared encoder. This strategy is critical for learning a robust shared embedding space that is both stable across subjects and discriminative for retrieval, particularly given the inherent noise and inter-subject variability in EEG signals. The shared encoder imposes a common transformation on both EEG and image representations, promoting a unified geometric rule across modalities and reducing modality-specific distortions.

The alignment process is bifurcated into two stages:

Coarse Stage: This initial phase prioritizes the stabilization of the shared semantic geometry and the reduction of subject-induced distribution shift. A combined objective function, incorporating both a symmetric contrastive retrieval loss and a multi-kernel maximum mean discrepancy (MK-MMD) loss, is utilized. The weighting coefficient for the MK-MMD loss gradually decays, shifting the optimization focus from structural alignment to discriminative learning.
Fine Stage: Once the global semantic geometry is adequately stabilized, the model proceeds to the fine stage. Here, the shared encoder is frozen, and the learning rate is reduced. Optimization is then exclusively focused on the retrieval objective, refining instance-level discriminability within the established shared space. This explicit decoupling of global shared-space formation from fine-grained retrieval refinement contributes to more robust optimization and enhanced cross-subject generalization. Ablation studies confirm the efficacy of this two-stage strategy, with the full model consistently outperforming one-stage variants, approaches without stage-specific learning rate adjustments, and versions lacking shared encoder freezing in the second stage.

Experimental Results and Implications

The SAMGA framework was extensively evaluated on the THINGS-EEG benchmark, achieving substantial performance improvements over existing state-of-the-art methods. In the intra-subject setting, SAMGA attained 91.3% Top-1 and 98.8% Top-5 accuracy. More notably, in the challenging inter-subject setting, the framework achieved 34.4% Top-1 and 64.8% Top-5 accuracy, representing improvements of 8.7% and 12.0% in Top-1 accuracy respectively, compared to prior baselines. Similar trends were observed on the THINGS-MEG dataset, yielding 49.4% Top-1 and 74.8% Top-5 accuracy intra-subject, and 6.1% Top-1 and 16.6% Top-5 accuracy inter-subject.

Semantic structure analysis revealed that the aligned EEG representations preserve meaningful semantic organization across subjects. Visualizations of inter-subject concept similarity matrices demonstrated clear local clustering within semantic categories, such as "animal" and "food," indicating that the EEG embeddings retain stable semantic structure. Further analysis indicated that optimal visual-depth preference varies across semantic categories, both at coarse and fine-grained levels. For example, "animal," "tool," and "others" favored layer 32, "food" preferred layer 36, and "vehicle" performed best at shallower intermediate layers (24 and 28). This category-dependent visual-depth preference underscores the necessity of adaptive granularity matching.

These findings carry significant implications for the field of visual neural decoding and brain-computer interfaces. Theoretically, the research demonstrates that robust EEG visual decoding is not solely dependent on robust EEG encoder design but critically relies on adaptive visual supervision target construction tailored to the multi-scale and subject-variable nature of EEG signals. Practically, SAMGA offers a more reliable framework for zero-shot EEG-to-image retrieval, which could enhance the scalability of visual neural decoding systems and improve the practicality of BCIs in assistive communication and intelligent interaction.

Future research directions suggested by this work include refining the granularity adaptation mechanism to capture finer trial-wise or state-dependent variability. Additionally, incorporating explicit category-aware or sample-adaptive routing into target construction and conducting more extensive cross-modality validation across EEG and MEG settings would further strengthen the generalizability of the proposed framework.

Conclusion

The SAMGA framework advances zero-shot EEG-to-image retrieval by introducing a subject-aware multi-granularity visual target construction module and a coarse-to-fine cross-modal alignment strategy. This approach explicitly addresses the subject-dependent granularity mismatch inherent in EEG signals, leading to notable improvements in retrieval accuracy across both intra-subject and inter-subject settings. The empirical results and detailed analyses underscore the importance of dynamic target construction and progressive alignment in bridging the gap between neural responses and visual representations, paving the way for more robust and adaptable visual neural decoding systems.

Markdown Report Issue