- The paper introduces BP-GPT, a method that employs fMRI signals as prompts to guide GPT-2 in decoding continuous language from auditory stimuli.
- It bridges the fMRI-text modality gap by aligning fMRI-derived prompts with optimal text prompts through contrastive learning and fine-tuning.
- Experimental results demonstrate significant improvements in BLEU, METEOR, and BERTScore, validating the method's effectiveness on open-vocabulary auditory decoding.
This paper introduces Brain Prompt GPT (BP-GPT), a novel method for open-vocabulary auditory neural decoding, specifically aiming to decode continuous language from fMRI signals. The research addresses two primary challenges in this domain: the low temporal resolution of fMRI and the significant modality gap between fMRI signals and text data.
The core idea behind BP-GPT is to utilize fMRI signals as a prompt to guide a pre-trained LLM, specifically GPT-2, for text generation. The method consists of two main components:
- fMRI-prompted text decoding: An fMRI encoder, implemented using a transformer architecture, processes the fMRI signals to extract a "brain prompt." This prompt is then fed into GPT-2, which generates the corresponding text autoregressively, token by token, treating the fMRI prompt as preceding context. The fMRI encoder is trained using a cross-entropy loss comparing the GPT-2 output logits to the target text tokens. The paper notes that fine-tuning the GPT-2 parameters during this stage can be beneficial.
- Alignment with the Optimal Prompt: To mitigate the fMRI-text modality gap, the method introduces a text-to-text baseline. This baseline uses BERT as a text encoder, maps the BERT representation to a "text prompt" using a mapping network (also a transformer), and then uses GPT-2 to decode the original text from this text prompt. Since the text-to-text baseline involves the same modality for input and output, its generated text prompt is considered an "optimal prompt" for the given text. Contrastive learning is then employed to align the fMRI prompt with this optimal text prompt, using a contrastive loss that maximizes the similarity between positive pairs (fMRI prompt and text prompt for the same text) and minimizes similarity with negative pairs (fMRI/text prompts from different texts or fMRI prompts from different texts).
The training process is divided into two stages: first, the text-to-text baseline is trained to learn how to generate the optimal text prompt. Second, the fMRI-to-text model is trained using a combined loss function that includes the cross-entropy loss for text generation and the contrastive loss for aligning fMRI prompts to the learned text prompts.
For the inference stage, after extracting the fMRI prompt, GPT-2 generates text word by word. A key challenge in the auditory decoding scenario is determining the end of text generation, as the auditory stimuli (and thus the ground-truth text) typically lack punctuation. The authors propose two strategies for this:
- Using a separate word rate model to predict the length of the text to be decoded.
- Adding special tokens (e.g., '\$') to the ground-truth text during training and stopping generation when enough special tokens are produced. Fine-tuning GPT-2 to recognize these tokens is shown to improve performance.
The BP-GPT method was evaluated on an open-source auditory semantic decoding dataset (2605.07840, UniCoRN: Unified Cognitive Signal ReconstructioN bridging cognitive signals and human language, 2023), specifically using data from subjects UTS01, UTS02, and UTS03, who listened to 84 stories. The auditory cortex ROI was used. Data was processed in 20-second, non-overlapping windows, and a prompt length of 30 tokens was used. The performance was measured using BLEU-1, METEOR, and BERTScore and compared against a state-of-the-art neural encoding-based method (2605.07840).
Experimental results show that BP-GPT achieves significant improvements over the baseline, with increases up to 4.61% on METEOR and 2.43% on BERTScore across subjects. Ablation studies confirm the effectiveness of both contrastive learning for modal alignment and the special token strategy for inference (especially when combined with GPT-2 fine-tuning). The performance was also shown to improve with increasing prompt length, although limited by computational resources.
In conclusion, BP-GPT demonstrates a feasible and effective approach for open-vocabulary auditory neural decoding by framing the problem as fMRI-prompted text generation using LLMs and employing contrastive learning to bridge the modality gap. The prompt-based approach is highlighted as being easy to implement and compatible with future advancements in LLMs, allowing for straightforward performance upgrades. Future work aims to apply this paradigm to other neural decoding tasks and integrate different LLMs.