Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Reverse the auditory processing pathway: Coarse-to-fine audio reconstruction from fMRI (2405.18726v1)

Published 29 May 2024 in cs.SD, cs.CV, cs.MM, and eess.AS

Abstract: Drawing inspiration from the hierarchical processing of the human auditory system, which transforms sound from low-level acoustic features to high-level semantic understanding, we introduce a novel coarse-to-fine audio reconstruction method. Leveraging non-invasive functional Magnetic Resonance Imaging (fMRI) data, our approach mimics the inverse pathway of auditory processing. Initially, we utilize CLAP to decode fMRI data coarsely into a low-dimensional semantic space, followed by a fine-grained decoding into the high-dimensional AudioMAE latent space guided by semantic features. These fine-grained neural features serve as conditions for audio reconstruction through a Latent Diffusion Model (LDM). Validation on three public fMRI datasets-Brain2Sound, Brain2Music, and Brain2Speech-underscores the superiority of our coarse-to-fine decoding method over stand-alone fine-grained approaches, showcasing state-of-the-art performance in metrics like FD, FAD, and KL. Moreover, by employing semantic prompts during decoding, we enhance the quality of reconstructed audio when semantic features are suboptimal. The demonstrated versatility of our model across diverse stimuli highlights its potential as a universal brain-to-audio framework. This research contributes to the comprehension of the human auditory system, pushing boundaries in neural decoding and audio reconstruction methodologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. A task-optimized neural network replicates human auditory behavior, predicts brain responses, and reveals a cortical processing hierarchy. Neuron, 98(3):630–644, 2018.
  2. Toward a realistic model of speech processing in the brain with self-supervised learning. Advances in Neural Information Processing Systems, 35:33428–33443, 2022.
  3. Many but not all deep neural network audio models capture brain responses and exhibit correspondence between model stages and brain regions. Plos Biology, 21(12):e3002366, 2023.
  4. Intermediate acoustic-to-semantic representations link behavioral and neural responses to natural sounds. Nature Neuroscience, 26(4):664–672, 2023.
  5. Evidence of a predictive coding hierarchy in the human brain listening to speech. Nature human behaviour, 7(3):430–441, 2023.
  6. Reconstructing the spectrotemporal modulations of real-life sounds from fmri response patterns. Proceedings of the National Academy of Sciences, 114(18):4799–4804, 2017.
  7. Sound reconstruction from human brain activity via a generative model with brain-like auditory features. arXiv preprint arXiv:2306.11629, 2023.
  8. Music can be reconstructed from human auditory cortex activity using nonlinear decoding models. PLoS biology, 21(8):e3002176, 2023.
  9. Ian Daly. Neural decoding of music from the eeg. Scientific Reports, 13(1):624, 2023.
  10. Brain2music: Reconstructing music from human brain activity. arXiv preprint arXiv:2307.11078, 2023.
  11. Reconstructing speech from human auditory cortex. PLoS biology, 10(1):e1001251, 2012.
  12. Speech reconstruction from human auditory cortex with deep neural networks. In Interspeech, pages 1121–1125, 2015.
  13. Reconstructing intelligible speech from the human auditory cortex. BioRxiv, page 350124, 2018.
  14. Synthesizing speech from ecog with a combination of transformer-based encoder and neural vocoder. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  15. Braintalker: Low-resource brain-to-speech synthesis with transfer learning using wav2vec 2.0. In 2023 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI), pages 1–5. IEEE, 2023.
  16. A neural speech decoding framework leveraging deep learning and speech synthesis. Nature Machine Intelligence, pages 1–14, 2024.
  17. Self-supervised models of audio effectively explain human cortical responses to speech. arXiv preprint arXiv:2205.14252, 2022.
  18. Dissecting neural computations in the human auditory pathway using deep neural networks for speech. Nature Neuroscience, 26(12):2213–2225, 2023.
  19. The hierarchical cortical organization of human speech processing. Journal of Neuroscience, 37(27):6539–6557, 2017.
  20. Mechanisms and streams for processing of “what” and “where” in auditory cortex. Proceedings of the National Academy of Sciences, 97(22):11800–11806, 2000.
  21. Subdivisions of auditory cortex and processing streams in primates. Proceedings of the National Academy of Sciences, 97(22):11793–11799, 2000.
  22. The neuroanatomical and functional organization of speech perception. Trends in neurosciences, 26(2):100–107, 2003.
  23. The cortical organization of speech processing. Nature reviews neuroscience, 8(5):393–402, 2007.
  24. Maps and streams in the auditory cortex: nonhuman primates illuminate human speech processing. Nature neuroscience, 12(6):718–724, 2009.
  25. Brains on beats. Advances in Neural Information Processing Systems, 29, 2016.
  26. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  27. Masked autoencoders that listen. Advances in Neural Information Processing Systems, 35:28708–28720, 2022.
  28. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  29. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33:17022–17033, 2020.
  30. Taming visually guided sound generation. arXiv preprint arXiv:2110.08791, 2021.
  31. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
  32. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. arXiv preprint arXiv:2308.05734, 2023.
  33. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017.
  34. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
  35. The kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011.
  36. Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023.
  37. Diffvoice: Text-to-speech with latent diffusion. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  38. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  39. Music genre neuroimaging dataset. Data in Brief, 40:107675, 2022.
  40. A natural language fmri dataset for voxelwise encoding models. Scientific Data, 10(1):555, 2023.
  41. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880–2894, 2020.
  42. Fréchet audio distance: A metric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466, 2018.
  43. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020.
  44. Musical genre classification of audio signals. IEEE Transactions on speech and audio processing, 10(5):293–302, 2002.
  45. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com