IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation (2308.08143v3)
Abstract: Recent research has made significant progress in designing fusion modules for audio-visual speech separation. However, they predominantly focus on multi-modal fusion at a single temporal scale of auditory and visual features without employing selective attention mechanisms, which is in sharp contrast with the brain. To address this issue, We propose a novel model called Intra- and Inter-Attention Network (IIANet), which leverages the attention mechanism for efficient audio-visual feature fusion. IIANet consists of two types of attention blocks: intra-attention (IntraA) and inter-attention (InterA) blocks, where the InterA blocks are distributed at the top, middle and bottom of IIANet. Heavily inspired by the way how human brain selectively focuses on relevant content at various temporal scales, these blocks maintain the ability to learn modality-specific features and enable the extraction of different semantics from audio-visual features. Comprehensive experiments on three standard audio-visual separation benchmarks (LRS2, LRS3, and VoxCeleb2) demonstrate the effectiveness of IIANet, outperforming previous state-of-the-art methods while maintaining comparable inference time. In particular, the fast version of IIANet (IIANet-fast) has only 7% of CTCNet's MACs and is 40% faster than CTCNet on CPUs while achieving better separation quality, showing the great potential of attention mechanism for efficient and effective multimodal fusion.
- Deep audio-visual speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):8717–8727, 2018a.
- Lrs3-ted: a large-scale dataset for visual speech recognition, 2018b.
- Self-supervised learning of audio-visual objects from video. In Proceedings of the European Conference on Computer Vision, pp. 208–224. Springer, 2020.
- The conversation: deep audio-visual speech enhancement. In Interspeech, volume 2018, 2018.
- Anatomical origins of the classical receptive field and modulatory surround field of single neurons in macaque visual cortical area v1. Progress in brain research, 136:373–388, 2002.
- Arons, B. A review of the cocktail party effect. Journal of the American Voice I/O Society, 12(7):35–50, 1992.
- Soundnet: Learning sound representations from unlabeled video. volume 29, 2016.
- Bar, M. The proactive brain: using analogies and associations to generate predictions. Trends in Cognitive Sciences, 11(7):280–289, 2007.
- Non-sensory cortical and subcortical connections of the primary auditory cortex in mongolian gerbils: bottom-up and top-down processing of neuronal information via field ai. Brain research, 1220:2–32, 2008.
- Distinct anatomical connectivity patterns differentiate subdivisions of the nonlemniscal auditory thalamus in mice. Cerebral cortex, 29(6):2437–2454, 2019.
- Calvert, G. A. Crossmodal processing in the human brain: insights from functional neuroimaging studies. Cerebral Cortex, 11(12):1110–1123, 2001.
- A neural state-space model approach to efficient speech separation. In Interspeech, 2023.
- Cherry, E. C. Some experiments on the recognition of speech, with one and with two ears. The Journal of the Acoustical Society of America, 25(5):975–979, 1953.
- Voxceleb2: Deep speaker recognition, 2018.
- The auditory organization of speech and other sources in listeners and computational models. Speech Communication, 35(3-4):141–177, 2001.
- Control of goal-directed and stimulus-driven attention in the brain. Nature reviews neuroscience, 3(3):201–215, 2002.
- A cross-modal system linking primary auditory and visual cortices: Evidence from intrinsic fmri connectivity analysis. Human brain mapping, 29(7):848–857, 2008.
- Anatomical evidence of multimodal integration in primate striate cortex. Journal of Neuroscience, 22(13):5749–5759, 2002.
- Distributed hierarchical processing in the primate cerebral cortex. Cerebral cortex (New York, NY: 1991), 1(1):1–47, 1991.
- Fukushima, K. A neural network model for selective attention in visual pattern recognition. Biological Cybernetics, 55(1):5–15, 1986.
- Visualvoice: Audio-visual speech separation with cross-modal consistency. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15490–15500. IEEE, 2021.
- Is neocortex essentially multisensory? Trends in cognitive sciences, 10(6):278–285, 2006.
- Mechanisms underlying selective neuronal tracking of attended speech at a “cocktail party”. Neuron, 77(5):980–991, 2013.
- Guinan Jr, J. J. Olivocochlear efferents: anatomy, physiology, function, and the measurement of efferent effects in humans. Ear and hearing, 27(6):589–607, 2006.
- Medial auditory thalamic nuclei are necessary for eyeblink conditioning. Behavioral neuroscience, 120(4):880, 2006.
- Deep clustering: Discriminative embeddings for segmentation and separation. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 31–35. IEEE, 2016.
- Speech separation using an asynchronous fully recurrent convolutional neural network. volume 34, pp. 22509–22522, 2021.
- Modelling auditory attention. Philosophical Transactions of the Royal Society B: Biological Sciences, 372(1714):20160101, 2017.
- On the variability of the mcgurk effect: audiovisual integration depends on prestimulus brain states. Cerebral Cortex, 22(1):221–231, 2012.
- Adam: A method for stochastic optimization, 2014.
- Attention and consciousness: two distinct brain processes. Trends in cognitive sciences, 11(1):16–22, 2007.
- Inferring mechanisms of auditory attentional modulation with deep neural networks. Neural Computation, 34(11):2273–2293, 2022.
- Sdr–half-baked or well done? In 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 626–630, 2019.
- Looking into your speech: Learning cross-modal affinity for audio-visual speech separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1336–1345. IEEE, 2021.
- An audio-visual speech separation model inspired by cortico-thalamo-cortical circuits, 2022.
- An efficient encoder-decoder architecture with top-down attention for speech separation. In International Conference on Learning Representations, 2023.
- Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8):1256–1266, 2019.
- Dual-path rnn: efficient long sequence modeling for time-domain single-channel speech separation. In 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 46–50. IEEE, 2020.
- Lip-reading with densely connected temporal convolutional networks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2857–2866, 2021.
- Audio-visual speech separation in noisy environments with a lightweight iterative model. In Interpseech. ISCA, 2023.
- Lipreading using temporal convolutional networks. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6319–6323. IEEE, 2020.
- Selective cortical representation of attended speaker in multi-talker speech perception. Nature, 485(7397):233–236, 2012.
- Functional response properties of vip-expressing inhibitory neurons in mouse visual and auditory cortex. Frontiers in neural circuits, 9:22, 2015.
- Vovit: low latency graph-based audio-visual voice separation transformer. In Proceedings of the European Conference on Computer Vision, pp. 310–326. Springer, 2022.
- Attentional selection in a cocktail party environment can be decoded from single-trial eeg. Cerebral Cortex, 25(7):1697–1706, 2015.
- Pytorch: An imperative style, high-performance deep learning library. volume 32, 2019.
- The attention system of the human brain. Annual review of neuroscience, 13(1):25–42, 1990.
- Reading to listen at the cocktail party: multi-modal speech separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10493–10502. IEEE, 2022.
- Audiovisual integration of letters in the human brain. Neuron, 28(2):617–625, 2000.
- Rensink, R. A. The dynamic representation of scenes. Visual cognition, 7(1-3):17–42, 2000.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Springer, 2015.
- Audio-visual speech and gesture recognition by sensors of mobile devices. Sensors, 23(4):2284, 2023.
- Schneider, W. X. Selective visual processing across competition episodes: A theory of task-driven visual attention and working memory. Philosophical Transactions of the Royal Society B: Biological Sciences, 368(1628):20130060, 2013.
- Listen, think and listen again: Capturing top-down auditory attention for speaker-independent speech separation. In IJCAI, pp. 4353–4360, 2018.
- Multisensory integration: current issues from the perspective of the single neuron. Nature Reviews Neuroscience, 9(4):255–266, 2008.
- Attention is all you need in speech separation. In 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 21–25. IEEE, 2021.
- Summerfield, Q. Lipreading and audio-visual speech perception. Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences, 335(1273):71–78, 1992.
- Tallon-Baudry, C. On the neural mechanisms subserving consciousness and attention. Frontiers in psychology, 2:397, 2012.
- Union, I. Wideband extension to recommendation p. 862 for the assessment of wideband telephone networks and speech codecs. International Telecommunication Union, Recommendation P, 862, 2007.
- Performance measurement in blind audio source separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 14(4):1462–1469, 2006a.
- Performance measurement in blind audio source separation. IEEE Transactions on Audio, Speech, and Language Processing, 14(4):1462–1469, 2006b.
- Time domain audio visual speech separation. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 667–673, 2019.
- Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 241–245. IEEE, 2017.
- Wavesplit: End-to-end speech separation by speaker clustering. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:2840–2849, 2021.
- Joint face detection and alignment using multitask cascaded convolutional networks. IEEE signal processing letters, 23(10):1499–1503, 2016.