Brain Interaction Transformer (BIT)

Updated 2 July 2026

Brain Interaction Transformer (BIT) is a transformer-based framework that maps neural data from fMRI to deep neural network representations using dynamic, multi-head attention.
It dynamically routes information between retinotopic neural sources and higher-order semantic targets, providing interpretable insights into brain function.
BIT has demonstrated improved performance in brain encoding, image reconstruction, and visual question answering, setting new benchmarks on fMRI datasets.

The Brain Interaction Transformer (BIT) is a transformer-based modeling framework for mapping between neural data—principally fMRI responses measured in visual cortex during perception of naturalistic images—and representations in deep neural networks and/or downstream semantic tasks. BIT is distinguished by its use of content-based, multi-head attention to dynamically route information between retinotopic or functionally clustered neural sources and higher-order representational or semantic targets, enabling both improved encoding and decoding performance as well as interpretable attributions of information flow in the brain. BIT has been instantiated in several lines of research—for encoding brain activity from images, reconstructing images from fMRI, and decoding semantic information such as captions and question answers—each leveraging the shared principle of attention-based neural routing (Adeli et al., 22 May 2025, Beliy et al., 29 Oct 2025, Beliy et al., 28 May 2026).

1. Architectural Principles and Attention-Based Routing

BIT architectures are unified by their use of transformer attention mechanisms for dynamic, content-sensitive routing of information from input representations (image features or brain responses) to output targets (brain activities, image features, or language tokens). The model typically consists of a sequence backbone and a multi-head cross-attention decoder:

In encoding mode (Adeli et al., 22 May 2025), the backbone is a frozen vision transformer (ViT; 12 layers, patch size 14×14, token dimension $d=768$ ), producing $N$ spatial tokens $X \in \mathbb{R}^{N \times d}$ , where $N$ indexes image patches. Positional embeddings $P \in \mathbb{R}^{N \times d}$ encode patch retinotopy; the cross-attention decoder employs learnable query embeddings per anatomical region of interest (ROI) or vertex, forming $Q \in \mathbb{R}^{Q \times d}$ .
The canonical cross-attention operation is:

$\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V$

with the attention score

$a_{ij} = \frac{\exp(q_i \cdot k_j / \sqrt{d_k})}{\sum_{j'}\exp(q_i\cdot k_{j'}/\sqrt{d_k})}, \quad z_i = \sum_{j=1}^N a_{ij} v_j$

Here, $K = X + P$ fuses spatial and content-based information, enabling ROI or vertex queries to access both local (retinotopic) and non-local (content-driven) features.

In the decoding direction (Beliy et al., 29 Oct 2025, Beliy et al., 28 May 2026), the input space consists of functionally clustered fMRI voxels, forming “brain tokens” by attention-based aggregation within clusters, followed by a deep stack of self-attention and cross-attention layers to produce target representations such as CLIP or VGG image features, caption tokens, or VQA prompt embeddings.

BIT’s attention scores naturally provide a dynamic, interpretable mapping from input to target: in encoding, attention quantifies how much a neural area “reads out” from each visual patch; in decoding, it identifies which clusters or regions contribute to which aspect of the output representation.

2. Functional Tokenization and Cluster-Based Representations

Decoding-oriented BIT architectures utilize a multi-stage aggregation of fMRI voxel activations:

Voxels $V$ are partitioned into $N$ 0 functional clusters $N$ 1 (typically $N$ 2); each voxel $N$ 3 is assigned a learned embedding $N$ 4, and each cluster $N$ 5 has an embedding $N$ 6. Clustering uses Gaussian mixture modeling (GMM) on fMRI encoding embeddings for intersubject generalization (Beliy et al., 29 Oct 2025, Beliy et al., 28 May 2026).
Cluster tokens are computed by within-cluster attention:

$N$ 7

$N$ 8

$N$ 9

where $X \in \mathbb{R}^{N \times d}$ 0 is the activation of voxel $X \in \mathbb{R}^{N \times d}$ 1. The resulting $X \in \mathbb{R}^{N \times d}$ 2-token sequence is further processed by $X \in \mathbb{R}^{N \times d}$ 3 layers of multi-head self-attention:

$X \in \mathbb{R}^{N \times d}$ 4

Final prediction flows through cross-attention from learned query tokens to the cluster sequence, generating outputs such as localized image features (CLIP, VGG) or language-conditioning prompts.

This structure supports efficient transfer and generalization: shared clusters and weights allow training and adaptation with limited new-subject data (e.g., high-fidelity reconstructions with only 1 hour of fMRI) (Beliy et al., 29 Oct 2025).

3. Training Paradigms and Objectives

BIT is trained under supervised objectives specific to the encoding or decoding task:

For visual encoding (Adeli et al., 22 May 2025), BIT minimizes mean squared error (MSE) between predicted and true fMRI responses per image and per-vertex/ROI:

$X \in \mathbb{R}^{N \times d}$ 5

No additional $X \in \mathbb{R}^{N \times d}$ 6/ $X \in \mathbb{R}^{N \times d}$ 7 regularization is used beyond dropout.

For image reconstruction (Beliy et al., 29 Oct 2025), a two-branch training is employed:
- Alignment loss: L2 loss matching predicted CLIP/VGG tokens against ground truth.
- Diffusion loss: standard denoising (U-Net) loss conditioned on semantic and structural features derived from BIT outputs:
$X \in \mathbb{R}^{N \times d}$ 8 - InfoNCE loss for VGG-feature matching.
For VQA (Beliy et al., 28 May 2026), training proceeds in two stages:
- Stage 1: MSE loss for BIT’s prediction of CLIP-aligned and LLM-conditioning tokens from fMRI.
- Stage 2: Negative log-likelihood for answer generation, finetuning BIT and Q-Former adapters via LoRA, with the LLM frozen.

Stochastic voxel subsampling, weight decay, and dropout are used for regularization. Data splits, cross-validation, and evaluation follow standard fMRI benchmarks (e.g., Algonauts/NSD).

4. Applications: Encoding, Reconstruction, and Visual Question Answering

BIT approaches have been applied to several high-level tasks:

Brain Encoding: Predicting fMRI responses from visual input. BIT achieves higher explainable variance than ridge regression or spatial-feature-factorized baselines on the Natural Scenes Dataset (NSD), showing improvements especially in category-selective ROIs (Adeli et al., 22 May 2025).
Image Reconstruction from fMRI: BIT predicts localized high-level (CLIP) and low-level (VGG) image features from fMRI, used as conditioning for diffusion models. This yields improved faithfulness (PixCorr 0.386, SSIM 0.486) and efficiency (1 hour transfer learning approaches full-data baselines) on NSD (Beliy et al., 29 Oct 2025).
Visual Question Answering (VQA) from fMRI: In the Brain-IT-VQA (BIT-L) framework, BIT enables end-to-end decoding of answers to image-related questions directly from fMRI, integrating with a pre-trained vision-LLM via cross-attention-generated prompt embeddings. Results on NSD-VQA show short-answer accuracy of 73.78% and high full-sentence metric scores (BLEU-4=88.09, METEOR=60.54, CIDEr=0.833), setting new benchmarks for neural decoding tasks (Beliy et al., 28 May 2026).

Table: Summary of BIT Applications and Performance Metrics

Task	Key Output	Example Metric (BIT vs. SOTA)
Brain Encoding	fMRI response	S1: 0.60 (BIT), 0.56 (ridge) (Adeli et al., 22 May 2025)
Image Reconstruction	Image (pixels/features)	PixCorr: 0.386 vs 0.322 (Beliy et al., 29 Oct 2025)
Visual Question Answering	Answer text	Short-answer: 73.78% vs 72.60% (Beliy et al., 28 May 2026)

5. Interpretability, Mechanistic Insights, and Neuroscientific Relevance

BIT’s attention structure provides direct interpretability: cross-attention weights indicate the routing of information from spatial/image or neural clusters to output units, which can be visualized or analyzed quantitatively.

In encoding, early visual ROIs exhibit spatially-specific attention matching known retinotopy, while higher-order areas (e.g., face- or body-selective) show content-driven, dynamic receptive fields (Adeli et al., 22 May 2025). This supports the hypothesis that high-level cortex gates information flow based on input content relevance—a plausible computational motif for biological vision.
In VQA decoding, masking and ridge analysis of cluster contributions link functional specializations (e.g., ventral clusters for food queries, dorsal for action/object queries) to specific decoding capability, enabling in-silico lesion studies and hypothesis-driven investigation (Beliy et al., 28 May 2026).

This suggest that transformer-based models such as BIT can serve as both predictive and explanatory tools for visual information processing in cortex, offering experimentally-grounded analogues for flexible attention and routing in biological neural systems.

6. Limitations and Prospects

Current BIT models are limited to static image fMRI; extensions to EEG, intracranial recordings, videos, and multisensory inputs remain to be explored (Adeli et al., 22 May 2025). The decoder’s single-layer architecture abstracts over known neural feedback and normalization mechanisms; plausible future modifications include deeper, recurrent, or anatomically-constrained modules, as well as direct anatomical wiring constraints in cluster assignment (Adeli et al., 22 May 2025, Beliy et al., 29 Oct 2025).

Notably, fine-grained semantic and non-visual features, as well as output space complexity, currently constrain decoding accuracy, with diminishing performance on fine-grained color, category, or attribution queries (Beliy et al., 28 May 2026). A plausible implication is that hierarchical and integrated representations—as implemented by deeper or multimodal transformers—may be required to capture the full complexity of brain-derived semantic information.

BIT’s shared-parameter, cross-subject framework, and dynamic attention mechanisms provide a platform for investigating neural representation, individual variability, and the functional role of cortical networks in cognition and perception.