Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Robust Wake Word Spotting With Frame-Level Cross-Modal Attention Based Audio-Visual Conformer (2403.01700v1)

Published 4 Mar 2024 in cs.SD, cs.MM, and eess.AS

Abstract: In recent years, neural network-based Wake Word Spotting achieves good performance on clean audio samples but struggles in noisy environments. Audio-Visual Wake Word Spotting (AVWWS) receives lots of attention because visual lip movement information is not affected by complex acoustic scenes. Previous works usually use simple addition or concatenation for multi-modal fusion. The inter-modal correlation remains relatively under-explored. In this paper, we propose a novel module called Frame-Level Cross-Modal Attention (FLCMA) to improve the performance of AVWWS systems. This module can help model multi-modal information at the frame-level through synchronous lip movements and speech signals. We train the end-to-end FLCMA based Audio-Visual Conformer and further improve the performance by fine-tuning pre-trained uni-modal models for the AVWWS task. The proposed system achieves a new state-of-the-art result (4.57% WWS score) on the far-field MISP dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. “Compressed Time Delay Neural Network for Small-Footprint Keyword Spotting,” in Proc. Interspeech, 2017, pp. 3607–3611.
  2. “Convolutional Neural Networks for Small-Footprint Keyword Spotting,” in Proc. Interspeech, 2015, pp. 1478–1482.
  3. “Wake Word Detection with Streaming Transformers,” in Proc. ICASSP, 2021, pp. 5864–5868.
  4. “Domain Aware Training for Far-Field Small-Footprint Keyword Spotting,” in Proc. Interspeech, 2020, pp. 2562–2566.
  5. “Generating TTS Based Adversarial Samples for Training Wake-Up Word Detection Systems Against Confusing Words,” in Proc. Odyssey, 2022, pp. 402–406.
  6. “Joint Ego-Noise Suppression and Keyword Spotting on Sweeping Robots,” in Proc. ICASSP, 2022, pp. 7547–7551.
  7. “End-To-End Audio-Visual Speech Recognition with Conformers,” in Proc. ICASSP, 2021, pp. 7613–7617.
  8. “Audio-Visual Efficient Conformer for Robust Speech Recognition,” in Proc. WACV, 2023, pp. 2257–2266.
  9. “Robust Self-Supervised Audio-Visual Speech Recognition,” in Proc. Interspeech, 2022, pp. 2118–2122.
  10. “VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency,” in Proc. CVPR, 2021, pp. 15490–15500.
  11. “A Multi-View Approach To Audio-Visual Speaker Verification,” in Proc. ICASSP, 2021, pp. 6194–6198.
  12. “Seeing Wake Words: Audio-Visual Keyword Spotting,” in Proc. BMVC, 2020.
  13. “The First Multimodal Information Based Speech Processing (Misp) Challenge: Data, Tasks, Baselines And Results,” in Proc. ICASSP, 2022, pp. 9266–9270.
  14. “Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep Analysis,” in Proc. Interspeech, 2022, pp. 1111–1115.
  15. “The DKU Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge,” in Proc. ICASSP, 2022, pp. 9256–9260.
  16. “Audio-Visual Wake Word Spotting System for MISP Challenge 2021,” in Proc. ICASSP, 2022, pp. 9246–9250.
  17. “VE-KWS: Visual Modality Enhanced End-to-End Keyword Spotting,” in Proc. ICASSP, 2023, pp. 1–5.
  18. “The DKU Post-Challenge Audio-Visual Wake Word Spotting System for the 2021 MISP Challenge: Deep Analysis,” in Proc. ICASSP, 2023, pp. 1–5.
  19. “Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition,” in Proc. Interspeech, 2022, pp. 2838–2842.
  20. “Cross-Channel Attention-Based Target Speaker Voice Activity Detection: Experimental Results for M2MeT Challenge,” in Proc. ICASSP, 2022, pp. 9171–9175.
  21. “Multi-Channel End-to-End Neural Diarization with Distributed Microphones,” in Proc. ICASSP, 2022, pp. 7332–7336.
  22. “MFCCA:Multi-Frame Cross-Channel attention for multi-speaker ASR in Multi-party meeting scenario,” in Proc. SLT, 2023, pp. 144–151.
  23. “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  24. “RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild,” in Proc. CVPR, June 2020.
  25. “ArcFace: Additive Angular Margin Loss for Deep Face Recognition,” in Proc. CVPR, 2019, pp. 4690–4699.
  26. “LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild,” in Proc. FG, 2019, pp. 1–8.
  27. “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” in Proc. Interspeech, 2019, pp. 2613–2617.
  28. “Quantifying Attention Flow in Transformers,” in Proc. ACL, 2020, pp. 4190–4197.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com