Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Brain-Conditional Multimodal Synthesis: A Survey and Taxonomy (2401.00430v2)

Published 31 Dec 2023 in cs.AI

Abstract: In the era of Artificial Intelligence Generated Content (AIGC), conditional multimodal synthesis technologies (e.g., text-to-image, text-to-video, text-to-audio, etc) are gradually reshaping the natural content in the real world. The key to multimodal synthesis technology is to establish the mapping relationship between different modalities. Brain signals, serving as potential reflections of how the brain interprets external information, exhibit a distinctive One-to-Many correspondence with various external modalities. This correspondence makes brain signals emerge as a promising guiding condition for multimodal content synthesis. Brian-conditional multimodal synthesis refers to decoding brain signals back to perceptual experience, which is crucial for developing practical brain-computer interface systems and unraveling complex mechanisms underlying how the brain perceives and comprehends external stimuli. This survey comprehensively examines the emerging field of AIGC-based Brain-conditional Multimodal Synthesis, termed AIGC-Brain, to delineate the current landscape and future directions. To begin, related brain neuroimaging datasets, functional brain regions, and mainstream generative models are introduced as the foundation of AIGC-Brain decoding and analysis. Next, we provide a comprehensive taxonomy for AIGC-Brain decoding models and present task-specific representative work and detailed implementation strategies to facilitate comparison and in-depth analysis. Quality assessments are then introduced for both qualitative and quantitative evaluation. Finally, this survey explores insights gained, providing current challenges and outlining prospects of AIGC-Brain. Being the inaugural survey in this domain, this paper paves the way for the progress of AIGC-Brain research, offering a foundational overview to guide future work.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (201)
  1. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
  2. A massive 7t fmri dataset to bridge cognitive neuroscience and artificial intelligence. Nature neuroscience, 25(1):116–126, 2022.
  3. Speech synthesis from neural decoding of spoken sentences. Nature, 568(7753):493–498, 2019.
  4. Dreamdiffusion: Generating high-quality images from brain eeg signals. arXiv preprint arXiv:2306.16934, 2023.
  5. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
  6. From voxels to pixels and back: Self-supervision in natural-image reconstruction from fmri. Advances in Neural Information Processing Systems, 32, 2019.
  7. Music can be reconstructed from human auditory cortex activity using nonlinear decoding models. PLoS biology, 21(8):e3002176, 2023.
  8. Brain decoding: toward real-time reconstruction of visual perception. arXiv preprint arXiv:2310.19812, 2023.
  9. High fidelity speech synthesis with adversarial networks. arXiv preprint arXiv:1909.11646, 2019.
  10. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127, 2023.
  11. Maps of the auditory cortex. Annual review of neuroscience, 39:385–407, 2016.
  12. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  13. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  14. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems, 33:9912–9924, 2020.
  15. Instance-conditioned gan. Advances in Neural Information Processing Systems, 34:27517–27529, 2021.
  16. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022.
  17. Bold5000, a public fmri dataset while viewing 5000 visual images. Scientific data, 6(1):49, 2019.
  18. Dreamcatcher: Revealing the language of the brain with fmri using gpt embedding. arXiv preprint arXiv:2306.10082, 2023.
  19. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020.
  20. Rethinking visual reconstruction: Experience-based content completion guided by visual cues. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 4856–4866. PMLR, 23–29 Jul 2023.
  21. Generative pretraining from pixels. In International conference on machine learning, pages 1691–1703. PMLR, 2020.
  22. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Advances in neural information processing systems, 29, 2016.
  23. An empirical study of training self-supervised vision transformers. In CVF International Conference on Computer Vision (ICCV), pages 9620–9629, 2021.
  24. Combination of high-frequency ssvep-based bci and computer vision for controlling a robotic arm. Journal of neural engineering, 16(2):026012, 2019.
  25. Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22710–22720, 2023.
  26. Cinematic mindscapes: High-quality video reconstruction from brain activity. arXiv preprint arXiv:2305.11675, 2023.
  27. Deep convolutional autoencoder-based lossy image compression. In 2018 Picture Coding Symposium (PCS), pages 253–257. IEEE, 2018.
  28. A wireless steady state visually evoked potential-based bci eating assistive system. In 2017 International Joint Conference on Neural Networks (IJCNN), pages 3003–3007. IEEE, 2017.
  29. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
  30. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250. IEEE, 2021.
  31. Pearson correlation coefficient. Noise reduction in speech processing, pages 1–4, 2009.
  32. Michael X Cohen. Where does eeg come from and what does it mean? Trends in neurosciences, 40(4):208–218, 2017.
  33. Fernando Lopes da Silva. Eeg and meg: relevance to neuroscience. Neuron, 80(5):1112–1128, 2013.
  34. Ian Daly. Neural decoding of music from the eeg. Scientific Reports, 13(1):624, 2023.
  35. Neural and physiological data from participants listening to affective music. Scientific Data, 7(1):177, 2020.
  36. Decoding speech perception from non-invasive brain recordings. Nature Machine Intelligence, pages 1–11, 2023.
  37. Human brain activity associated with audiovisual perception and attention. Neuroimage, 34(4):1683–1691, 2007.
  38. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  39. Brain2music: Reconstructing music from human brain activity. arXiv preprint arXiv:2307.11078, 2023.
  40. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  41. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34:19822–19835, 2021.
  42. Adversarial audio synthesis. arXiv preprint arXiv:1802.04208, 2018.
  43. Large scale adversarial representation learning. Advances in neural information processing systems, 32, 2019.
  44. Decoding visual neural representations by multimodal learning of brain-visual-linguistic features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  45. Dewave: Discrete eeg waves encoding for brain dynamics to text translation. arXiv preprint arXiv:2309.14030, 2023.
  46. Imagebart: Bidirectional context with multinomial diffusion for autoregressive image synthesis. Advances in neural information processing systems, 34:3518–3532, 2021.
  47. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
  48. Reconstructing perceptive images from brain activity by shape-semantic gan. Advances in Neural Information Processing Systems, 33:13038–13048, 2020.
  49. Semantic brain decoding: from fmri to conceptually similar image reconstruction of visual stimuli. arXiv preprint arXiv:2212.06726, 2022.
  50. Brain captioning: Decoding human brain activity into images and text. arXiv preprint arXiv:2305.11560, 2023.
  51. Karl J Friston. Models of brain function in neuroimaging. Annu. Rev. Psychol., 56:57–87, 2005.
  52. Brain areas underlying visual mental imagery and visual perception: an fmri study. Cognitive Brain Research, 20(2):226–241, 2004.
  53. Self-supervised natural image reconstruction and large-scale semantic classification from brain activity. NeuroImage, 254:119121, 2022.
  54. More than meets the eye: Self-supervised depth reconstruction from brain activity. arXiv preprint arXiv:2106.05113, 2021.
  55. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017.
  56. Brain states: top-down influences in sensory processing. Neuron, 54(5):677–696, 2007.
  57. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
  58. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  59. The human visual cortex. Annu. Rev. Neurosci., 27:649–677, 2004.
  60. Decoding natural image stimuli from fmri data with a surface-based convolutional network. arXiv preprint arXiv:2212.02409, 2022.
  61. End-to-end translation of human neural activity to speech with a dual–dual generative adversarial network. Knowledge-Based Systems, 277:110837, 2023.
  62. Variational autoencoder: An unsupervised model for encoding and decoding fmri activity in visual cortex. NeuroImage, 198:125–136, 2019.
  63. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  64. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  65. Things-data, a multimodal collection of large-scale datasets for investigating object representations in human brain and behavior. Elife, 12:e82580, 2023.
  66. Things: A database of 1,854 object concepts and more than 26,000 naturalistic object images. PloS one, 14(10):e0223792, 2019.
  67. What does fmri tell us about neuronal activity? Nature reviews neuroscience, 3(2):142–151, 2002.
  68. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  69. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  70. Generic decoding of seen and imagined objects using hierarchical visual features. Nature communications, 8(1):15037, 2017.
  71. Mulan: A joint embedding of music audio and natural language. arXiv preprint arXiv:2208.12415, 2022.
  72. A neural decoding algorithm that generates language from visual activity evoked by natural images. Neural Networks, 144:90–100, 2021.
  73. Long short-term memory-based neural decoding of object categories evoked by natural images. Human Brain Mapping, 41(15):4442–4453, 2020.
  74. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
  75. Moabb: trustworthy algorithm benchmarking for bcis. Journal of neural engineering, 15(6):066011, 2018.
  76. The power of sound (tpos): Audio reactive video generation with stable diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7822–7832, 2023.
  77. Decoding eeg by visual-guided deep neural networks. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 1387–1393, 2019.
  78. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
  79. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  80. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
  81. Brain2image: Converting brain signals into images. In Proceedings of the 25th ACM international conference on Multimedia, pages 1809–1817, 2017.
  82. Identifying natural images from human brain activity. Nature, 452(7185):352–355, 2008.
  83. Neurovision: perceived image regeneration using cprogan. Neural Computing and Applications, 34(8):5979–5991, 2022.
  84. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  85. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
  86. Mark A Kramer. Autoassociative neural networks. Computers & chemical engineering, 16(4):313–328, 1992.
  87. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  88. Robert Kubichek. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of IEEE pacific rim conference on communications computers and signal processing, volume 1, pages 125–128. IEEE, 1993.
  89. Envisioned speech recognition using eeg sensors. Personal and Ubiquitous Computing, 22:185–199, 2018.
  90. A penny for your (visual) thoughts: Self-supervised reconstruction of natural movies from brain activity. arXiv preprint arXiv:2206.03544, 2022.
  91. Seeing through the brain: Image reconstruction of visual perception from human brain signals. arXiv preprint arXiv:2308.02510, 2023.
  92. Autoencoding beyond pixels using a learned similarity metric. In International conference on machine learning, pages 1558–1566. PMLR, 2016.
  93. Brain2pix: Fully convolutional naturalistic video frame reconstruction from brain activity. Frontiers in Neuroscience, 16:940972, 2022.
  94. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019.
  95. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  96. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  97. Bci-based neuroprostheses and physiotherapies for stroke motor rehabilitation. In Neurorehabilitation Technology, pages 509–524. Springer, 2022.
  98. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  99. Mind reader: Reconstructing complex images from brain activities. Advances in Neural Information Processing Systems, 35:29624–29636, 2022.
  100. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  101. Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023.
  102. Brainclip: Bridging brain and visual-linguistic representation via clip for generic natural visual stimulus decoding from fmri. arXiv preprint arXiv:2302.12971, 2023.
  103. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015.
  104. Neurophysiological investigation of the basis of the fmri signal. nature, 412(6843):150–157, 2001.
  105. Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action. arXiv preprint arXiv:2312.17172, 2023.
  106. Minddiffuser: Controlled image reconstruction from human brain activity with semantic and structural diffusion. arXiv preprint arXiv:2303.14139, 2023.
  107. Unibrain: Unify image reconstruction and captioning all in one diffusion model from human brain activity. arXiv preprint arXiv:2308.07428, 2023.
  108. Generating natural language descriptions for semantic representations of human brain activity. In Proceedings of the ACL 2016 student research workshop, pages 22–29, 2016.
  109. Describing semantic representations of brain activity evoked by visual stimuli. In 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 576–583. IEEE, 2018.
  110. Dual-guided brain diffusion model: Natural image reconstruction from human visual stimulus fmri. Bioengineering, 10(10):1117, 2023.
  111. Semantics-guided hierarchical feature encoding generative adversarial network for natural image reconstruction from brain activities. In 2023 International Joint Conference on Neural Networks (IJCNN), pages 1–9. IEEE, 2023.
  112. Electrical brain imaging reveals spatio-temporal dynamics of timbre perception in humans. Neuroimage, 32(4):1510–1523, 2006.
  113. Generative decoding of visual stimuli. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 24775–24784. PMLR, 23–29 Jul 2023.
  114. Neurogan: image reconstruction from eeg signals via an attention-based gan. Neural Computing and Applications, 35(12):9181–9192, 2023.
  115. Visual image reconstruction from human brain activity using a combination of multiscale local image decoders. Neuron, 60(5):915–929, 2008.
  116. Human primary auditory cortex: cytoarchitectonic subdivisions and mapping into a spatial reference system. Neuroimage, 13(4):684–701, 2001.
  117. Reconstructing natural scenes from fmri patterns using bigbigan. In 2020 International joint conference on neural networks (IJCNN), pages 1–8. IEEE, 2020.
  118. Eeg-based neuroprosthesis control: a step towards clinical practice. Neuroscience letters, 382(1-2):169–174, 2005.
  119. Broca’s area and the language instinct. Nature neuroscience, 6(7):774–781, 2003.
  120. Music genre neuroimaging dataset. Data in Brief, 40:107675, 2022.
  121. The “narratives” fmri dataset for evaluating models of naturalistic language comprehension. Scientific data, 8(1):250, 2021.
  122. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. Advances in neural information processing systems, 29, 2016.
  123. Reconstructing visual experiences from brain activity evoked by natural movies. Current biology, 21(19):1641–1646, 2011.
  124. Convolutional auto-encoder for image denoising of ultra-low-dose ct. Heliyon, 3(8), 2017.
  125. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
  126. Reconstruction of perceived images from fmri patterns and semantic brain exploration using instance-conditioned gans. In 2022 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2022.
  127. Natural scene reconstruction from fmri signals using generative latent diffusion. Scientific Reports, 13(1):15666, 2023.
  128. Decoding brain representations by multimodal learning of neural activity and visual features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):3833–3849, 2020.
  129. Generative adversarial networks conditioned by brain signals. In Proceedings of the IEEE international conference on computer vision, pages 3410–3418, 2017.
  130. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pages 311–318, 2002.
  131. Sound reconstruction from human brain activity via a generative model with brain-like auditory features. arXiv preprint arXiv:2306.11629, 2023.
  132. The development of human functional brain networks. Neuron, 67(5):735–748, 2010.
  133. Joint fmri decoding and encoding with latent embedding alignment, 2023.
  134. Biggan-based bayesian reconstruction of natural images from human brain activity. Neuroscience, 444:92–105, 2020.
  135. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  136. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  137. Improving language understanding by generative pre-training. 2018.
  138. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  139. Enhanced system robustness of asynchronous bci in augmented reality using steady-state motion visual evoked potential. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 30:85–95, 2022.
  140. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019.
  141. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, 2019.
  142. Reconstructing seen image from brain activity by visually-guided cognitive representation and adversarial learning. NeuroImage, 228:117602, 2021.
  143. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  144. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  145. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  146. Temporal generative adversarial nets with singular value clipping. In Proceedings of the IEEE international conference on computer vision, pages 2830–2839, 2017.
  147. Linear reconstruction of perceived images from human brain activity. NeuroImage, 83:951–961, 2013.
  148. Reconstructing the mind’s eye: fmri-to-image with contrastive learning and diffusion priors. arXiv preprint arXiv:2305.18274, 2023.
  149. Generative adversarial networks for reconstructing natural images from brain activity. NeuroImage, 181:775–785, 2018.
  150. A large single-participant fmri dataset for probing brain responses to naturalistic stimuli in space and time. BioRxiv, page 687681, 2019.
  151. End-to-end deep image reconstruction from human brain activity. Frontiers in computational neuroscience, 13:21, 2019.
  152. Deep image reconstruction from human brain activity. PLoS computational biology, 15(1):e1006633, 2019.
  153. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  154. Eeg2image: Image reconstruction from eeg brain signals. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  155. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
  156. Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, 28, 2015.
  157. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  158. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  159. Deep learning human mind for automated visual classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6809–6817, 2017.
  160. Contrast, attend and diffuse to decode high-resolution images from brain activities. arXiv preprint arXiv:2305.17214, 2023.
  161. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
  162. Generation of viewed image captions from human brain activity via unsupervised text latent space. In 2020 IEEE International Conference on Image Processing (ICIP), pages 2521–2525. IEEE, 2020.
  163. High-resolution image reconstruction with latent diffusion models from human brain activity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14453–14463, 2023.
  164. Improving visual image reconstruction from human brain activity using latent diffusion models via multiple decoded inputs. arXiv preprint arXiv:2306.11536, 2023.
  165. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
  166. Brain encoding models based on multimodal transformers can transfer across language and vision. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  167. Semantic reconstruction of continuous language from non-invasive brain recordings. Nature Neuroscience, pages 1–9, 2023.
  168. Thoughtviz: Visualizing human thoughts using generative adversarial network. In Proceedings of the 26th ACM international conference on Multimedia, pages 950–958, 2018.
  169. Frank Tong. Primary visual cortex and visual awareness. Nature Reviews Neuroscience, 4(3):219–229, 2003.
  170. On the context-dependent nature of the contribution of the ventral premotor cortex to speech perception. NeuroImage, 57(4):1561–1571, 2011.
  171. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535, 2018.
  172. Neural style transfer for image within images and conditional gans for destylization. Journal of Visual Communication and Image Representation, 85:103483, 2022.
  173. Conditional image generation with pixelcnn decoders. Advances in neural information processing systems, 29, 2016.
  174. Pixel recurrent neural networks. In International conference on machine learning, pages 1747–1756. PMLR, 2016.
  175. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  176. The wu-minn human connectome project: an overview. Neuroimage, 80:62–79, 2013.
  177. Neural decoding with hierarchical generative models. Neural computation, 22(12):3127–3142, 2010.
  178. Reconstructing faces from fmri patterns using deep generative neural networks. Communications biology, 2(1):193, 2019.
  179. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  180. Reconstructing rapid natural vision with fmri-conditional video generative adversarial network. Cerebral Cortex, 32(20):4502–4511, 2022.
  181. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100, 2022.
  182. Video-to-video synthesis. arXiv preprint arXiv:1808.06601, 2018.
  183. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  184. The current research of combining multi-modal brain-computer interfaces with virtual reality. IEEE Journal of Biomedical and Health Informatics, 25(9):3278–3287, 2020.
  185. Neural encoding and decoding with deep learning for dynamic natural vision. Cerebral cortex, 28(12):4136–4160, 2018.
  186. Nüwa: Visual synthesis pre-training for neural visual world creation. In European conference on computer vision, pages 720–736. Springer, 2022.
  187. Unicorn: Unified cognitive signal reconstruction bridging cognitive signals and human language. arXiv preprint arXiv:2307.05355, 2023.
  188. Dream: Visual decoding from reversing human visual system. arXiv preprint arXiv:2310.02265, 2023.
  189. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010.
  190. Versatile diffusion: Text, images and variations all in one diffusion model. arXiv preprint arXiv:2211.08332, 2022.
  191. Self-supervised cross-modal visual retrieval from brain activities. Pattern Recognition, 145:109915, 2024.
  192. Image-to-image translation using a cross-domain auto-encoder and decoder. Applied Sciences, 9(22):4780, 2019.
  193. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
  194. Leaf: A learnable frontend for audio classification. arXiv preprint arXiv:2101.08596, 2021.
  195. Controllable mind visual diffusion model. arXiv preprint arXiv:2305.10135, 2023.
  196. Dm-re2i: A framework based on diffusion model for the reconstruction from eeg to image. Biomedical Signal Processing and Control, 86:105125, 2023.
  197. A cnn-transformer hybrid approach for decoding visual neural activity into text. Computer Methods and Programs in Biomedicine, 214:106586, 2022.
  198. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  199. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2019.
  200. Yifei Zhang. A better autoencoder for image: Convolutional autoencoder. In ICONIP17-DCEC. Available online: http://users. cecs. anu. edu. au/Tom. Gedeon/conf/ABCs2018/paper/ABCs2018_paper_58. pdf (accessed on 23 March 2017), 2018.
  201. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pages 2223–2232, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Weijian Mai (6 papers)
  2. Jian Zhang (543 papers)
  3. Pengfei Fang (29 papers)
  4. Zhijun Zhang (25 papers)
Citations (6)

Summary

  • The paper introduces a comprehensive survey and taxonomy of methods mapping brain signals to generative models for multimodal synthesis.
  • It details the use of neuroimaging techniques and AI models such as VAEs, GANs, and latent diffusion for decoding perceptual stimuli.
  • The study highlights challenges and future directions, emphasizing improved fidelity, interpretability, and real-time brain-computer interfacing.

Neuroimaging and AI: Deciphering the Brain's Perception for Multimodal Content Synthesis

Overview of Multimodal Content Synthesis

The latest developments in neuroscientific research and AI have presented unprecedented opportunities for exploring the relationship between brain activity and the perception of diverse stimuli, such as images, videos, and audio. Multimodal synthesis technology is an ever-evolving field that aims to decode the complex mapping between the brain's activity and various forms of external stimuli. This exploration into brain-conditional multimodal synthesis offers potential breakthroughs in developing tangible brain-computer interface (BCI) systems and exploring the underlying cognitive mechanisms.

Neuroimaging Data and Brain Regions

Neuroimaging technologies such as fMRI, EEG, and MEG provide a window into the brain's intricate neural activities by capturing data on blood flow, electrical, and magnetic fields. Each technology offers distinct trade-offs between spatial and temporal resolution. Understanding these datasets is crucial for deciphering the functionalities and interactions of different brain regions, which in turn illuminates the complex processes of perception.

Moreover, identifying key brain regions involved in the processing of auditory, visual, and language information enables researchers to pinpoint the neural basis of perception. Regions such as the visual cortex, auditory cortex, and language-related areas in the frontal lobe play prominent roles in these perceptive tasks.

Generative Models in AI

Generative models have seen strides in progress, covering deterministic autoencoders (AEs), probabilistic models like variational autoencoders (VAEs), autoregressive models (AMs), and Generative Adversarial Networks (GANs). Their applications stretch across the realms of image, audio, and text synthesis. Conditional generative models introduce a new dimension of complexity by infusing conditional information into the generative process.

Latent Diffusion Models (LDMs) are particularly notable for their ability to generate high-quality images by integrating conditions into the denoising process. ControlNet and Versatile Diffusion stand out for their multimodal generation capabilities, leveraging guidance from paired text and images.

Methodology Taxonomy

The methodologies in brain-conditional multimodal content synthesis can be categorized into six distinct types based on their implementation architecture:

  1. Mapping Brain to Prior Information: Mapping brain signals to semantic or detail priors within the pre-trained generative models.
  2. Brain-Pretrain and Mapping: Involves a two-step process of pre-training on brain signals and then mapping to priors.
  3. Brain-Pretrain, Finetune, and Align: Another two-step approach, emphasizing the alignment of priors with pre-trained models and fine-tuning.
  4. Map, Train, and Finetune: Creates a connection between brain signals, priors, and stimuli, followed by training or fine-tuning the generative architecture.
  5. End-to-End and Autoencoder-Based Aligning: Directly map brain signals to stimuli, either through a traditional training process or with deterministic autoencoder alignment.

The trade-offs between these methods vary in terms of training complexity, flexibility, data requirements, and interpretability.

Tasks and Implementation Strategies

Different tasks in AIGC-Brain research leverage various methods and technologies. For example, Image-Brain-Image (IBI) tasks make extensive use of I2I-LDMs that integrate detail priors and semantic conditions for image synthesis. In the Video-Brain-Video (VBV) domain, augmented diffusion models lead to improved video reconstruction from brain activity. Similarly, Sound-Brain-Sound (SBS) tasks see models like BSR employing autoregressive transformers to generate sound from brain signals.

Text-based tasks, such as Image&Video&Speech-Brain-Text (IBT, VBT, SBT), utilize autoregressive models to decode brain signals into linguistic descriptions. Multi-modal tasks are advancing towards more consolidated models capable of understanding and generating content across different modalities.

Quality Assessment and Insights

Quality assessments are indispensable to evaluate the synthesis results both qualitatively and quantitatively. While qualitative assessments show what is achievable in terms of reconstructing perception from brain signals, quantitative metrics offer a more objective measure of the models' performance. Metrics are tailored for different levels of features, from low-level details like pixel correlation to high-level semantic fidelity like CLIP embeddings. These assessments drive the progression of technology by highlighting areas for improvement and guiding new model development.

Future Directions

The field is approaching several significant challenges:

  • Data Variability: The acquisition of higher quality, large-scale neuroimaging datasets is essential.
  • Fidelity: Improving semantic and detail accuracy in content synthesis is crucial.
  • Flexibility: Enhancing model adaptability to various datasets and tasks will promote generalization.
  • Interpretability: Understanding neural processing during decoding enriches our comprehension of cognition.
  • Real-time: Advancements in real-time decoding are vital for BCI systems.
  • Multimodality: Developing unified models for brain-to-any multimodal generation is an upcoming frontier.

These technological landscapes and future aspirations chart a course towards deepening our understanding of brain function and the potential of AI-assisted brain signal decoding.

Github Logo Streamline Icon: https://streamlinehq.com