NeuGPT: Unified Multi-Modal Neural Decoding
- NeuGPT is a unified multi-modal language generation model that harmonizes diverse neural recording methods, enabling joint processing of EEG, MEG, ECoG, fMRI, and fNIRS.
- It employs a two-stage framework with NeuTokenizer for discrete neural tokenization and a transformer-based LLM to integrate cross-modal data streams from neural, speech, and text signals.
- NeuGPT achieves significant performance gains in brain-to-text decoding and neural signal simulation, paving the way for real-time BCI applications and synthetic neural data generation.
NeuGPT is a unified multi-modal language generation model designed to bridge fragmentation in neural recording research. Unlike earlier approaches that treated EEG, MEG, ECoG, SEEG, fMRI, and fNIRS in isolation with bespoke analytic pipelines, NeuGPT harmonizes these modalities by enabling joint processing and generation of neural, speech, and text data streams. Built upon the paradigm of large pre-trained models from NLP, computer vision, and speech processing, NeuGPT introduces novel discrete neural tokenization and multi-modal transformer-based inference, advancing both brain-to-text decoding and neural signal simulation capabilities. The codebase and models are publicly accessible at https://github.com/NeuSpeech/NeuGPT (Yang et al., 2024).
1. Motivation and Conceptual Framework
Neural recording research has historically suffered from compartmentalization, with each modality (EEG, MEG, etc.) handled via task-specific models tailored to distinct signal properties—spatial layout, temporal resolution, signal-to-noise, and biological context. The fragmentation impeded transfer of progress and inductive biases across communities, resulting in restrictive task vocabularies and slow advances. In contrast, large-scale, instruction-tuned foundation models in NLP (e.g., LLMs), computer vision, and speech have demonstrated the power of unified architectures capable of handling diverse modalities through flexible tokenization.
NeuGPT responds to this gap with a two-stage framework:
- Neural Signal Tokenizer (NeuTokenizer): An auto-encoder enhanced by a residual vector quantizer (RVQ) and adversarial discriminator, which converts raw time-series neural data into discrete codebook indices and reconstructs signals from these codes.
- Multi-Modal LLM (NeuGPT): A transformer-based generative model (QWEN2-1.5B) extended to process tokens from neural recordings, speech, and text, using modality-specific embeddings and special markers for segmentation, enabling cross-modal translation and integration.
2. Model Architecture
NeuGPT employs a transformer decoder as its backbone, operating over a mixed token stream comprising text, neural codes, and speech tokens. For each input token, modality-specific learned embeddings are used, where , , and denote the respective embedding matrices. The transformer layer update equations are:
Tokens are combined with positional encodings and passed through layers consisting of multi-head attention, feed-forward networks, residual connections, and layer normalization. No dedicated fusion modules are required; modality integration emerges through transformer cross-attention over the joint token stream.
3. Training Procedures and Loss Functions
NeuGPT’s training is structured into two stages:
Stage 1 (NeuTokenizer):
- Reconstruction loss (time-domain):
- Multi-scale STFT (spectral) loss:
- RVQ commitment loss:
- Adversarial discriminator hinge loss () and gradient penalty ()
- Feature-matching loss ()
- Overall generator loss:
Stage 2 (Multi-Modal LLM):
- Cross-entropy loss over the mixed token stream:
No explicit contrastive loss is used; alignment between modalities emerges naturally from instruction-tuning on all conversion pairs among text, speech, and neural tokens.
4. Multi-Modal Tokenization and Integration
The encoding and merging procedures are as follows:
- Neural Encoding (MEG): Raw MEG signals of shape are processed through a SEANet-style encoder to yield embeddings. These are quantized via RVQ (codebook size 8192, ) to discrete indices . Channel indices are grouped per quantized time step and prefixed with <nts> tokens.
- Speech Encoding: Hidden-unit BERT (HuBERT) generates discrete speech codes . HiFi-GAN is used for vocoder-based reconstruction.
- Token Merging: Mixed-modality codes are interleaved in a single token sequence. Start/end markers (<soeg>, <eoeg>, <sosp>, <eosp>) denote neural and speech spans respectively. The transformer attends jointly to text, speech, and neural codes.
5. Performance Analysis
Evaluation on the MEG-MASC dataset (27 subjects, 4 stories) using MAD’s split protocol demonstrated substantial performance gains in the brain-to-text conversion task:
| Method | BLEU-1 (%) | ROUGE-1F (%) | BERTScore | CER |
|---|---|---|---|---|
| Random-select | 5.86 | 7.20 | 83.73 | 87.30 |
| NeuSpeech(2024) | 5.49 | 8.43 | 83.98 | 77.02 |
| MAD(2024) | 6.94 | 6.93 | 83.39 | 89.82 |
| NeuGPT | 12.92 | 13.06 | 83.62 | 99.8 |
NeuGPT achieves near-doubling of BLEU-1 (6.94 → 12.92) and ROUGE-1F (6.93 → 13.06) versus prior SOTA (Yang et al., 2024). BERTScore remains competitive; CER is higher, plausibly due to open-vocabulary predictions. Temporal L1 reconstruction yields near-perfect waveform fidelity, and multi-scale STFT reconstructions demonstrate accurate magnitude and phase recovery.
6. Ablation Studies and System Component Analysis
A series of ablations highlight the necessity of NeuGPT's design choices:
- Removing adversarial losses from NeuTokenizer degrades both reconstruction fidelity and transformer training convergence.
- Using a single quantizer instead of RVQ yields >20% reduction in tokenizer accuracy and ~2 BLEU points loss.
- Omitting channel/time tags (<nts>) hampers spatial localization across MEG channels, reducing BLEU by ~1.5 points.
This suggests that discrete neural coding and structured tokenization (including explicit channel/time tagging) are essential for optimal cross-modal generalization.
7. Applications and Prospective Extensions
NeuGPT inaugurates several new research and application domains:
- Real-time brain-to-text decoding: Enables pipelined inference from wearable MEG/EEG, potentially streaming covert neural activity as text.
- Neural signal simulation: Allows generation of synthetic MEG waveforms conditional on naturalistic prompts (“Generate the MEG codes for hearing a thunderstorm”), with full time-series reconstruction for BCI and modeling use-cases.
- Multi-modality expansion: Future releases intend to add ECoG and fMRI tokenizers, vision tokenizers, and support for broader neuroscience tasks (sleep-stage detection, motor intent decoding, closed-loop neurofeedback).
NeuGPT establishes a foundation for integrated multi-modal neuroscience and advanced BCI, unifying the process of “reading” and “writing” neural activity within a single scalable architecture (Yang et al., 2024).