Brain Treebank: Neural & Linguistic Integration
- Brain Treebank Dataset is a comprehensive resource combining intracranial recordings with precise linguistic annotations from naturalistic movie stimuli.
- It includes extensive multimodal metadata—spanning visual, auditory, and contextual features—to support cross-domain analysis of language processing.
- The dataset applies advanced methodologies such as GLMs and decoding analyses to reveal spatio-temporal neural dynamics underlying language comprehension.
The Brain Treebank is a large-scale resource integrating high-resolution intracranial neural recordings with richly annotated linguistic, audiovisual, and contextual information derived from naturalistic movie stimuli. It provides a comprehensive English Universal Dependencies (UD) treebank precisely time-aligned to brain responses, designed as a bridge between linguistic concepts, multimodal perception, and their neural representations in humans.
1. Dataset Composition
The Brain Treebank comprises electrophysiological brain recordings from 10 human subjects (ages 4–19; balanced gender distribution), each undergoing clinical monitoring with depth electrodes for epilepsy at Boston Children’s Hospital. Subjects viewed a total of 26 unique Hollywood movies across 21 sessions, yielding an aggregate of 43.5 hours of neural data with each participant observing an average of 2.6 films (mean viewing time: 4.3 hours per subject). Recordings covered 1,688 electrodes in total, with an average of 168 electrodes per participant.
The dataset includes:
- Over 38,000 fully annotated sentences and approximately 223,000 words (12,412 unique word types), all manually transcribed and precisely aligned to the audio signal (word onsets/offsets marked on spectrograms).
- Each token is provided with a linguistically validated Universal Dependencies parse: automatic parses generated using the Stanza toolkit were exhaustively reviewed for part-of-speech (POS) and dependency accuracy.
- A spectrum of 16 automatically derived features spans visual (pixel brightness, optical flow, number of faces), auditory (volume, pitch), and linguistic domains (word index in sentence, word length, GPT-2–based surprisal, syntactic head relationships).
- Multimodal annotations include scene labels (adopting the Places365 taxonomy) and detailed speaker identification, with manual resolution of ambiguous or impersonated speech turns.
- The data forms one of the largest English UD treebanks and is among the few aligning this level of linguistic annotation with both neural and multimodal environmental context.
2. Data Acquisition and Annotation Methodology
Electrophysiology
Intracranial stereo-EEG (sEEG) was sampled at 2048 Hz from depth electrode arrays of 6–16 contacts each, capturing intracranial field potentials (IFPs) at fine spatial and temporal granularity. Only interictal periods were retained, with exclusion of epochs marked by or immediately following epileptic seizures.
Stimulus Presentation
Movies were presented with a custom Matlab-based player, ensuring synchronization between video, audio, and the neural recording stream. Electronic triggers were sent every 100 ms (unique codes for play, pause, resume) to maintain alignment accuracy.
Annotation Workflow
- Transcription: Commercial ASR provided initial transcripts, which were manually corrected. Trained annotators marked word onsets and offsets with frame-level accuracy on spectrograms.
- Linguistic Parsing: Each corrected transcript underwent automatic parsing into UD structures using Stanza, with subsequent manual revision to ensure gold-standard POS tags and dependency edges.
- Feature Extraction: OpenCV and Librosa libraries extracted visual (brightness, flow, faces) and auditory (volume, pitch, sliding-window changes) features, respectively. GPT-2 computed per-word surprisal scores ().
Neural Signal Processing
Neural signals underwent minimal preprocessing: notch filtering at 60 Hz and harmonics, co-registration of electrode spatial locations using the Desikan-Killiany brain atlas, with white-matter electrode coordinates projected onto the gray–white boundary.
3. Analytical Approaches and Key Findings
Generalized Linear Models (GLMs) were employed to predict mean neural activity in a post-word-onset window ($500$ ms) based on all 16 visual, auditory, and language-derived regressors. The magnitude of each regressor’s beta coefficient reflected its explanatory significance.
- Analysis of neural time-courses demonstrated robust, time-locked word-onset responses. For example, a representative electrode in the left superior temporal sulcus exhibited consistent activation deflections commencing just prior to and locking onto word onsets, observable both at the single-trial (raster) and population-average level.
- GLM results indicated strong contributions from linguistic and auditory predictors—particularly word index in sentence, part-of-speech, and volume change—to explain neural activation variance post word-onset.
- At the sentence scale, distributed electrodes (especially temporal and frontal lobes) exhibited modulation by word position (sentence onset, middle, offset). Approximately 235 of 1688 electrodes demonstrated statistically significant sensitivity to position, independent of audiovisual features.
- Linear decoding analyses (logistic regression) of neural data in time-resolved sliding 250 ms windows (advanced in 100 ms steps) revealed differential timing and localization of linguistic computations. Decoding of sentence onsets produced earlier peak ROC–AUC values in temporal regions (∼100–200 ms post-onset) and later in frontal cortices (up to 300 ms). Decoding of noun/verb part-of-speech distinctions reached maximum discriminability in frontal and cingulate electrodes with distinct latency profiles.
4. Multimodal Features and Dataset Scope
Each word in the Brain Treebank is annotated with intricate multimodal metadata. The table summarizes the main annotation domains:
Feature Type | Examples (selected) | Tool/Source |
---|---|---|
Visual | Brightness, optical flow, number of faces | OpenCV |
Auditory | Volume, pitch, 500 ms changes in both | Librosa |
Language | POS tag, word index, length, dependency head, surprisal | Stanza, GPT-2 |
Contextual | Scene label (Places365), speaker identity | Manual annotation |
The scale—over 38,000 sentences, 223,000 words, 12,412 unique words, and thousands of speaker/scenario transitions—renders the Brain Treebank one of the largest multimodal, linguistically detailed brain datasets, and a substantial resource in English UD treebanking with direct neural alignment.
5. Computational and Modeling Frameworks
Two principal classes of models were utilized:
- Generalized Linear Models (GLMs): Target variable was mean neural response in ms post word onset. Sixteen regressors (five visual, four auditory, six language-derived, one contextual) were entered; feature weights (||) quantified explanatory power.
- Feature importances were interpretable at per-electrode and whole-brain levels.
- GPT-2–generated word-level surprisals quantified linguistic prediction: .
- Logistic Regression Decoders: Binary classifiers were trained in sliding 250 ms windows (shifted by 100 ms) around word or sentence onsets, using neural features per electrode as input. Decoder performance was evaluated by ROC–AUC, enabling spatio-temporal mapping of linguistic information processing.
- Decoding was applied to both sentence onset detection and POS class discrimination (noun vs. verb).
- Data preprocessing included downsampling, rigorous normalization, and out-of-fold transformation to prevent information leakage in cross-validation.
- Mixed-effects models aggregated decoder outputs across subjects and movies.
6. Applications and Implications
The Brain Treebank enables the quantitative fusion of neural, linguistic, and multimodal analyses. Immediate applications include:
- Training and benchmarking computational models (statistical or neural network–based) for predicting or decoding language-relevant features (e.g., surprisal, POS, word onsets) from brain activity in ecologically valid contexts.
- Empirical exploration of linguistic theory in realistic settings, including the neural correlates of syntactic structure, lexical prediction, and the influence of discourse context.
- Analyses of cross-modal integration and context effects in language processing, made possible by tightly aligned annotations spanning scenes, speakers, and visual/auditory features.
This resource allows researchers to investigate the neural time-course and topography of language comprehension, with potential implications for both theoretical models and biomedical language neurotechnology.
7. Access, Distribution, and Documentation
The complete Brain Treebank corpus is available at https://BrainTreebank.dev/ under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. Distributed assets include:
- Raw HDF5-formatted neural recordings with detailed electrode metadata and spatial registration.
- Full suite of aligned linguistic, audiovisual, contextual, speaker, and scene annotations.
- Gold-standard Universal Dependencies trees, manual word-spectrogram alignments, and computed feature files.
- Comprehensive documentation, sample code notebooks (including quickstart guides), and scripts for streamlined data access and preprocessing.
All resources are made openly accessible to facilitate further research in neuroscience, computational linguistics, artificial intelligence, and allied disciplines.