iPhonMatchNet: Zero-Shot User-Defined Keyword Spotting Using Implicit Acoustic Echo Cancellation (2309.06096v3)
Abstract: In response to the increasing interest in human--machine communication across various domains, this paper introduces a novel approach called iPhonMatchNet, which addresses the challenge of barge-in scenarios, wherein user speech overlaps with device playback audio, thereby creating a self-referencing problem. The proposed model leverages implicit acoustic echo cancellation (iAEC) techniques to increase the efficiency of user-defined keyword spotting models, achieving a remarkable 95% reduction in mean absolute error with a minimal increase in model size (0.13%) compared to the baseline model, PhonMatchNet. We also present an efficient model structure and demonstrate its capability to learn iAEC functionality without requiring a clean signal. The findings of our study indicate that the proposed model achieves competitive performance in real-world deployment conditions of smart devices.
- “Ica-based efficient blind dereverberation and echo cancellation method for barge-in-able robot audition,” in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2009, pp. 3677–3680.
- “Development of a robot quizmaster with auditory functions for speech-based multiparty interaction,” in 2014 IEEE/SICE International Symposium on System Integration. IEEE, 2014, pp. 328–333.
- “A study for improving device-directed speech detection toward frictionless human-machine interaction.,” in INTERSPEECH, 2019, pp. 3342–3346.
- “Exploring attention mechanism for acoustic-based classification of speech utterances into system-directed and non-system-directed,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 7310–7314.
- “Acoustic echo canceller with high speech quality,” in ICASSP’87. IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 1987, vol. 12, pp. 2125–2128.
- “Study of the general kalman filter for echo cancellation,” IEEE transactions on audio, speech, and language processing, vol. 21, no. 8, pp. 1539–1549, 2013.
- “Deep learning for joint acoustic echo and noise cancellation with nonlinear distortions.,” in Interspeech, 2019, pp. 4255–4259.
- “Acoustic Echo Cancellation with the Dual-Signal Transformation LSTM Network,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 7138–7142.
- “Low-complexity acoustic echo cancellation with neural kalman filtering,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- “Icassp 2021 acoustic echo cancellation challenge: Datasets, testing framework, and results,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 151–155.
- “Icassp 2022 acoustic echo cancellation challenge,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 9107–9111.
- “Implicit acoustic echo cancellation for keyword spotting and device-directed speech detection,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 1052–1058.
- “Query-by-example keyword spotting using long short-term memory networks,” in Proc. ICASSP 2015, 2015, pp. 5236–5240.
- “Donut: Ctc-based query-by-example keyword spotting,” NeurIPS 2018 - Workshop on Interpretability and Robustness in Audio, Speech, and Language (IRASL), 2018.
- “Query-by-example keyword spotting system using multi-head attention and soft-triple loss,” in Proc. ICASSP 2021, 2021, pp. 6858–6862.
- “Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting,” in Proc. Interspeech 2022, 2022, pp. 1871–1875.
- “PhonMatchNet: Phoneme-Guided Zero-Shot Keyword Spotting for User-Defined Keywords,” in Proc. INTERSPEECH 2023, 2023, pp. 3964–3968.
- “Flexible keyword spotting based on homogeneous audio-text embedding,” arXiv preprint arXiv:2308.06472, 2023.
- “Training keyword spotters with limited and synthesized speech data,” in Proc. ICASSP 2020, 2020, pp. 7474–7478.
- Pete Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018.
- “Query-by-example on-device keyword spotting,” in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 532–538.
- “MUSAN: A Music, Speech, and Noise Corpus,” 2015, arXiv:1510.08484v1.
- “A scalable noisy speech dataset and online subjective test framework,” Proc. Interspeech 2019, pp. 1816–1820, 2019.
- “Pyroomacoustics: A python package for audio room simulation and array processing algorithms,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 351–355.