Cooperative Dual Attention for Audio-Visual Speech Enhancement with Facial Cues (2311.14275v1)
Abstract: In this work, we focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE). The facial region, encompassing the lip region, reflects additional speech-related attributes such as gender, skin color, nationality, etc., which contribute to the effectiveness of AVSE. However, static and dynamic speech-unrelated attributes also exist, causing appearance changes during speech. To address these challenges, we propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE. Specifically, we introduce a spatial attention-based visual encoder to capture and enhance visual speech information beyond the lip region, incorporating global facial context and automatically ignoring speech-unrelated information for robust visual feature extraction. Additionally, a dynamic visual feature fusion strategy is introduced by integrating a temporal-dimensional self-attention module, enabling the model to robustly handle facial variations. The acoustic noise in the speaking process is variable, impacting audio quality. Therefore, a dynamic fusion strategy for both audio and visual features is introduced to address this issue. By integrating cooperative dual attention in the visual encoder and audio-visual fusion strategy, our model effectively extracts beneficial speech information from both audio and visual cues for AVSE. Thorough analysis and comparison on different datasets, including normal and challenging cases with unreliable or absent visual information, consistently show our model outperforming existing methods across multiple metrics.
- Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence, 44(12):8717–8727, 2018a.
- Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496, 2018b.
- My lips are concealed: Audio-visual speech enhancement through obstructions. Proc. Interspeech 2019, pages 4295–4299, 2019.
- Estimation of ideal binary mask for audio-visual monaural speech enhancement. Circuits, Systems, and Signal Processing, pages 1–25, 2023.
- The third ‘chime’speech separation and recognition challenge: Dataset, task and baselines. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 504–511. IEEE, 2015.
- Steven Boll. Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on acoustics, speech, and signal processing, 27(2):113–120, 1979.
- A smartphone-based multi-functional hearing assistive system to facilitate speech recognition in the classroom. IEEE Access, 5:10339–10351, 2017.
- E Colin Cherry. Some experiments on the recognition of speech, with one and with two ears. The Journal of the acoustical society of America, 25(5):975–979, 1953.
- Lite audio-visual speech enhancement. arXiv preprint arXiv:2005.11769, 2020.
- An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 120(5):2421–2424, 2006.
- Real time speech enhancement in the waveform domain. arXiv preprint arXiv:2006.12847, 2020.
- Icassp 2022 deep noise suppression challenge. In ICASSP, 2022.
- Evaluation of speech enhancement techniques for speaker identification in noisy environments. In Ninth IEEE International Symposium on Multimedia Workshops (ISMW 2007), pages 235–239. IEEE, 2007.
- Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator. IEEE Transactions on acoustics, speech, and signal processing, 32(6):1109–1121, 1984.
- Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Transactions on Graphics (TOG), 37(4):1–11, 2018a.
- Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation. ACM Transactions on Graphics, 37(4):1–11, August 2018b. ISSN 0730-0301, 1557-7368. arXiv:1804.03619 [cs, eess].
- Complex spectrogram enhancement by convolutional neural network with multi-metrics learning. In 2017 IEEE 27th international workshop on machine learning for signal processing (MLSP), pages 1–6. IEEE, 2017.
- End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(9):1570–1584, 2018.
- Visual speech enhancement. arXiv preprint arXiv:1711.08789, 2017.
- Visualvoice: Audio-visual speech separation with cross-modal consistency. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15490–15500. IEEE, 2021.
- Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Transactions on Emerging Topics in Computational Intelligence, 2(2):117–128, 2018.
- Av (se) 2: Audio-visual squeeze-excite speech enhancement. In ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 7539–7543. IEEE, 2020.
- Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(1):153–167, 2016.
- Harry Levit. Noise reduction in hearing aids: An overview. J. Rehabil. Res. Develop, 38(1):111–121, 2001.
- An investigation of spectral restoration algorithms for deep neural networks based noise robust speech recognition. In Interspeech, pages 3002–3006, 2013.
- Robust automatic speech recognition: a bridge to practical applications. 2015.
- All-pole modeling of degraded speech. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(3):197–210, 1978.
- Experiments on deep learning for speech denoising. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.
- Speech enhancement based on deep denoising autoencoder. In Interspeech, volume 2013, pages 436–440, 2013.
- Towards practical lipreading with distilled and efficient models. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7608–7612. IEEE, 2021.
- Lipreading using temporal convolutional networks. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6319–6323. IEEE, 2020.
- Hearing lips and seeing voices. Nature, 264(5588):746–748, 1976.
- Deep-learning-based audio-visual speech enhancement in presence of lombard effect. Speech Communication, 115:38–50, 2019.
- Face landmark-based speaker-independent audio-visual speech enhancement in multi-talker environments. In ICASSP 2019, pages 6900–6904. IEEE, 2019.
- Audio-visual speech recognition using deep learning. Applied intelligence, 42:722–737, 2015.
- Overview of speech enhancement techniques for automatic speaker recognition. In Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP’96, volume 2, pages 929–932. IEEE, 1996.
- Muse: Multi-modal target speaker extraction with visual cues. extraction, 7:10.
- A new framework for cnn-based speech enhancement in the time domain. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(7):1179–1188, 2019.
- Mir_eval: A transparent implementation of common mir metrics. In ISMIR, pages 367–372, 2014.
- Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221), volume 2, pages 749–752. IEEE, 2001.
- Robust audio-visual speech recognition under noisy audio-video conditions. IEEE transactions on cybernetics, 44(2):175–184, 2013.
- Fully convolutional recurrent networks for speech enhancement. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6674–6678. IEEE, 2020.
- An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, 19(7):2125–2136, 2011.
- Audio-visual speech recognition using deep bottleneck features and high-performance lipreading. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pages 575–582. IEEE, 2015.
- Fast real-time personalized speech enhancement: End-to-end enhancement network (e3net) and knowledge distillation. arXiv preprint arXiv:2204.00771, 2022.
- Remixit: Continual self-training of speech enhancement models via bootstrapped remixing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1329–1341, 2022.
- T Venema. Compression for clinicians, chapter 5. The many faces of compression.: Thomson Delmar Learning, 2006.
- A robust audio-visual speech enhancement model. In ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 7529–7533. IEEE, 2020.
- Towards scaling up classification-based speech separation. IEEE Transactions on Audio, Speech, and Language Processing, 21(7):1381–1390, 2013.
- On training targets for supervised speech separation. IEEE/ACM transactions on audio, speech, and language processing, 22(12):1849–1858, 2014.
- Complex ratio masking for monaural speech separation. IEEE/ACM transactions on audio, speech, and language processing, 24(3):483–492, 2015.
- A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(1):7–19, 2014.
- Fan-face: a simple orthogonal improvement to deep face recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12621–12628, 2020.
- Audio-visual recognition of overlapped speech for the lrs2 dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6984–6988. IEEE, 2020.
- Is lip region-of-interest sufficient for lipreading? In Proceedings of the 2022 International Conference on Multimodal Interaction, pages 368–372, 2022.
- Can we read speech beyond the lips? rethinking roi selection for deep visual speech recognition. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pages 356–363. IEEE, 2020.
- Real-time audio-visual end-to-end speech enhancement. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
- Feixiang Wang (3 papers)
- Shuang Yang (56 papers)
- Shiguang Shan (136 papers)
- Xilin Chen (119 papers)