Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (2405.11459v3)

Published 19 May 2024 in eess.SP, cs.CL, and q-bio.NC

Abstract: Invasive brain-computer interfaces with Electrocorticography (ECoG) have shown promise for high-performance speech decoding in medical applications, but less damaging methods like intracranial stereo-electroencephalography (sEEG) remain underexplored. With rapid advances in representation learning, leveraging abundant recordings to enhance speech decoding is increasingly attractive. However, popular methods often pre-train temporal models based on brain-level tokens, overlooking that brain activities in different regions are highly desynchronized during tasks. Alternatively, they pre-train spatial-temporal models based on channel-level tokens but fail to evaluate them on challenging tasks like speech decoding, which requires intricate processing in specific language-related areas. To address this issue, we collected a well-annotated Chinese word-reading sEEG dataset targeting language-related brain networks from 12 subjects. Using this benchmark, we developed the Du-IN model, which extracts contextual embeddings based on region-level tokens through discrete codex-guided mask modeling. Our model achieves state-of-the-art performance on the 61-word classification task, surpassing all baselines. Model comparisons and ablation studies reveal that our design choices, including (i) temporal modeling based on region-level tokens by utilizing 1D depthwise convolution to fuse channels in the ventral sensorimotor cortex (vSMC) and superior temporal gyrus (STG) and (ii) self-supervision through discrete codex-guided mask modeling, significantly contribute to this performance. Overall, our approach -- inspired by neuroscience findings and capitalizing on region-level representations from specific brain regions -- is suitable for invasive brain modeling and represents a promising neuro-inspired AI approach in brain-computer interfaces.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Real-time synthesis of imagined speech processes from minimally invasive recordings of neural activity. Communications biology, 4(1):1055, 2021.
  2. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  3. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  4. Human brain language areas identified by functional magnetic resonance imaging. Journal of Neuroscience, 17(1):353–362, 1997.
  5. Functional organization of human sensorimotor cortex for speech articulation. Nature, 495(7441):327–332, 2013.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. Encoding of articulatory kinematic trajectories in human speech sensorimotor cortex. Neuron, 98(5):1042–1054, 2018.
  8. Generative pretraining from pixels. In International conference on machine learning, pages 1691–1703. PMLR, 2020.
  9. Eegformer: Towards transferable and interpretable large-scale eeg foundation model. arXiv preprint arXiv:2401.10278, 2024.
  10. Neural latent aligner: cross-trial alignment for learning representations of complex, naturalistic neural data. In International Conference on Machine Learning, pages 5661–5676. PMLR, 2023.
  11. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  13. The control of vocal pitch in human laryngeal motor cortex. Cell, 174(1):21–31, 2018.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  15. Dewave: Discrete eeg waves encoding for brain dynamics to text translation. arXiv preprint arXiv:2309.14030, 2023.
  16. A high-performance brain-sentence communication designed for logosyllabic language. bioRxiv, pages 2023–11, 2023.
  17. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  18. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
  19. Large brain model for learning generic representations with tremendous EEG data in BCI. In The Twelfth International Conference on Learning Representations, 2024.
  20. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  21. Bendr: using transformers and a contrastive self-supervised learning task to learn from massive amounts of eeg data. Frontiers in Human Neuroscience, 15:653659, 2021.
  22. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
  23. Optimal referencing for stereo-electroencephalographic (seeg) recordings. NeuroImage, 183:327–335, 2018.
  24. itransformer: Inverted transformers are effective for time series forecasting. arXiv preprint arXiv:2310.06625, 2023.
  25. Bstt: A bayesian spatial-temporal transformer for sleep staging. In The Eleventh International Conference on Learning Representations, 2022.
  26. A high-performance neuroprosthesis for speech decoding and avatar control. Nature, 620(7976):1037–1046, 2023.
  27. Neuroprosthesis for decoding speech in a paralyzed person with anarthria. New England Journal of Medicine, 385(3):217–227, 2021.
  28. BEiT v2: Masked image modeling with vector-quantized visual tokenizers. 2022.
  29. Improving language understanding by generative pre-training. 2018.
  30. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019.
  31. A deep learning framework for neuroscience. Nature neuroscience, 22(11):1761–1770, 2019.
  32. If deep learning is the answer, what is the question? Nature Reviews Neuroscience, 22(1):55–67, 2021.
  33. The cortical maps of hierarchical linguistic structures during speech perception. Cerebral cortex, 29(8):3232–3240, 2019.
  34. Eeg conformer: Convolutional transformer for eeg decoding and visualization. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 31:710–719, 2022.
  35. Semantic reconstruction of continuous language from non-invasive brain recordings. Nature Neuroscience, 26(5):858–866, 2023.
  36. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  37. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  38. Brainbert: Self-supervised representation learning for intracranial recordings. arXiv preprint arXiv:2302.14367, 2023.
  39. Timesnet: Temporal 2d-variation modeling for general time series analysis. In The eleventh international conference on learning representations, 2022.
  40. Learning topology-agnostic eeg representations with geometry-aware modeling. Advances in Neural Information Processing Systems, 36, 2024.
  41. Brant: Foundation model for intracranial neural signal. Advances in Neural Information Processing Systems, 36, 2024.
  42. Universal sleep decoder: Aligning awake and sleep neural representation across subjects. arXiv preprint arXiv:2309.16457, 2023.
Citations (1)

Summary

We haven't generated a summary for this paper yet.