TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction

Published 29 Jul 2025 in cs.LG | (2507.22229v1)

Abstract: Historically, neuroscience has progressed by fragmenting into specialized domains, each focusing on isolated modalities, tasks, or brain regions. While fruitful, this approach hinders the development of a unified model of cognition. Here, we introduce TRIBE, the first deep neural network trained to predict brain responses to stimuli across multiple modalities, cortical areas and individuals. By combining the pretrained representations of text, audio and video foundational models and handling their time-evolving nature with a transformer, our model can precisely model the spatial and temporal fMRI responses to videos, achieving the first place in the Algonauts 2025 brain encoding competition with a significant margin over competitors. Ablations show that while unimodal models can reliably predict their corresponding cortical networks (e.g. visual or auditory networks), they are systematically outperformed by our multimodal model in high-level associative cortices. Currently applied to perception and comprehension, our approach paves the way towards building an integrative model of representations in the human brain. Our code is available at https://github.com/facebookresearch/algonauts-2025.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces TRIBE, a transformer-based model that integrates text, audio, and video features to predict whole-brain fMRI responses.
The paper details a methodology using pretrained embeddings and specialized linear layers to standardize dimensions before transformer encoding, capturing 54% of explainable variance.
The paper demonstrates significant performance improvements in multisensory integration and out-of-distribution scenarios, validated through competition results and modality ablation studies.

TRIBE: TRImodal Brain Encoder for Whole-Brain fMRI Response Prediction

Introduction

The paper "TRIBE: TRImodal Brain Encoder for whole-brain fMRI response prediction" presents a neural network architecture that aims to predict fMRI brain responses to video stimuli across various modalities, cortical regions, and individuals. Historically, neuroscience has tended to focus on specialized domains; this has led to fragmented modeling efforts that overlook integrative and unified representations of cognition. TRIBE addresses this gap by integrating text, audio, and video modalities using a transformer-based architecture to achieve high-precision predictions of spatial and temporal fMRI responses.

Figure 1: TRIBE predicts brain responses to videos across diverse regions. For eight brain parcels color-coded in the brain images, we report the BOLD response of the first participant to the first 5 minutes of a held-out movie in solid lines and our model's predictions in dashed lines, with the Pearson correlation of the two curves reported on the left.

Methods

Model Architecture

TRIBE leverages pretrained embedding layers from state-of-the-art foundational models for text (Llama-3.2-3B), audio (Wav2Vec-Bert-2.0), and video (Video-JEPA 2). Each modality generates high-dimensional features, which are aggregated through specialized linear layers to standardize their dimensions. Subsequently, these embeddings feed into a transformer encoder that models interactions between these modalities across temporal sequences.

Figure 2: Visual summary of our method.

Dataset and Preprocessing

The model is trained on the Courtois NeuroMod dataset, which features over 80 hours of fMRI recordings from four subjects viewing episodes from TV series and movies. Preprocessing involves projecting BOLD signals to the MNI standard space and parcellating these signals into 1000 parcels using the Schaefer atlas. This greatly reduces computational complexity while preserving functional mapping.

Training Protocol

Optimization is performed using AdamW, incorporating learning rate schedules and modality dropout to ensure robust multimodal integration. The authors employ stochastic weight averaging and ensembling across hyperparameter variations to enhance generalization. Feature extraction is further optimized by using high-throughput GPU resources.

Figure 3: Distribution across parcels.

Results

Competition Performance

TRIBE rose to prominence by securing first place in the Algonauts 2025 competition in multimodal brain encoding, demonstrating substantial performance enhancements over competing models. It excelled in predicting responses in associative cortices, crucial for multisensory integration.

Notably, TRIBE maintained robust performance even in out-of-distribution testing scenarios, as depicted through various held-out datasets. Such adaptability illustrates its potential utility across diverse futuristic applications.

Figure 4: Modality ablations.

Ablation Studies

Ablation analyses reveal that TRIBE's multimodality confers significant predictive advantages compared to unimodal models. Specifically, cross-modal information processing in associative brain regions results in notable performance gains.

Additionally, the noise ceiling analyses elucidate that the model captures 54% of explainable variance within the dataset averages, attesting to its effective modeling capability across brain areas.

Figure 5: Highest scoring modalities.

Scaling Laws

The study demonstrates promising scaling laws correlated with dataset size and model depth, hinting at further improvements in encoding accuracy with increased data availability and model complexity. These trends stress the importance of scaling models in predicting complex biological systems.

Figure 6: Model ablations.

Conclusion

The TRIBE architecture represents a pioneering effort to encapsulate multimodal and multisubject brain encoding within a singular framework. Its ability to successfully predict whole-brain responses to complex stimuli marks a substantial advancement in achieving integrative models for understanding human cognition, extending broader implications in neuroscience research.

Limitations remain prevalent, including spatial resolution limitations inherent in the use of parcel-level mappings and dependency on fMRI data. Future advances should aim to encompass voxel-level predictions and additional cognitive dimensions like memory and decision-making processes.

The broader impact of TRIBE underscores its applicability for probing deeply into cognitive complexities via naturalistic stimuli, promoting interdisciplinary alignment across neuroscience domains by leveraging AI advancements. This work sets the stage for future endeavors in computational cognition and in silico neuroexperimental frameworks.

Markdown Report Issue