Contrastive Learning With Audio Discrimination For Customizable Keyword Spotting In Continuous Speech (2401.06485v1)
Abstract: Customizable keyword spotting (KWS) in continuous speech has attracted increasing attention due to its real-world application potential. While contrastive learning (CL) has been widely used to extract keyword representations, previous CL approaches all operate on pre-segmented isolated words and employ only audio-text representations matching strategy. However, for KWS in continuous speech, co-articulation and streaming word segmentation can easily yield similar audio patterns for different texts, which may consequently trigger false alarms. To address this issue, we propose a novel CL with Audio Discrimination (CLAD) approach to learning keyword representation with both audio-text matching and audio-audio discrimination ability. Here, an InfoNCE loss considering both audio-audio and audio-text CL data pairs is employed for each sliding window during training. Evaluations on the open-source LibriPhrase dataset show that the use of sliding-window level InfoNCE loss yields comparable performance compared to previous CL approaches. Furthermore, experiments on the continuous speech dataset LibriSpeech demonstrate that, by incorporating audio discrimination, CLAD achieves significant performance gain over CL without audio discrimination. Meanwhile, compared to two-stage KWS approaches, the end-to-end KWS with CLAD achieves not only better performance, but also significant speed-up.
- Sankaran Panchapagesan et al., “Multi-task learning and weighted cross-entropy for dnn-based keyword spotting.,” in Proc. Interspeech, 2016, pp. 760–764.
- Ming Sun et al., “Compressed time delay neural network for small-footprint keyword spotting.,” in Proc. Interspeech, 2017, pp. 3607–3611.
- Kenichi Kumatani et al., “Direct modeling of raw audio with dnns for wake word detection,” in Proc. IEEE ASRU, 2017, pp. 252–257.
- Minhua Wu et al., “Monophone-based background modeling for two-stage on-device wake word detection,” in Proc. IEEE ICASSP, 2018, pp. 5494–5498.
- Jinxi Guo et al., “Time-delayed bottleneck highway networks using a dft feature for keyword spotting,” in Proc. IEEE ICASSP, 2018, pp. 5489–5493.
- Guoguo Chen et al., “Small-footprint keyword spotting using deep neural networks,” in Proc. IEEE ICASSP, 2014, pp. 4087–4091.
- “Convolutional neural networks for small-footprint keyword spotting,” in Proc. Interspeech, 2015, pp. 1478–1482.
- Pallavi Baljekar et al., “Online word-spotting in continuous speech with recurrent neural networks,” in Proc. IEEE SLT, 2014, pp. 536–541.
- Ming Sun et al., “Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting,” in Proc. IEEE SLT, 2016, pp. 474–480.
- Changhao Shan et al., “Attention-based end-to-end models for small-footprint keyword spotting,” in Proc. Interspeech, 2018, pp. 2037–2041.
- Xiong Wang et al., “Adversarial examples for improving end-to-end attention-based small-footprint keyword spotting,” in Proc. IEEE ICASSP, 2019, pp. 6366–6370.
- Rohit Prabhavalkar et al., “Automatic gain control and multi-style training for robust small-footprint keyword spotting with deep neural networks,” in Proc. IEEE ICASSP, 2015, pp. 4704–4708.
- Yimeng Zhuang et al., “Unrestricted vocabulary keyword spotting using LSTM-CTC,” in Proc. Interspeech, 2016, pp. 938–942.
- “End-to-end keyword search system based on attention mechanism and energy scorer for low resource languages,” Neural Networks, vol. 139, pp. 326–334, 2021.
- Bo Wei et al., “End-to-end transformer-based open-vocabulary keyword spotting with location-guided local attention,” in Proc. Interspeech, 2021, pp. 361–365.
- Yu Xi et al., “Text adaptive detection for customizable keyword spotting,” in Proc. IEEE ICASSP, 2022, pp. 6652–6656.
- Guoguo Chen et al., “Query-by-example keyword spotting using long short-term memory networks,” Proc. IEEE ICASSP, pp. 5236–5240, 2015.
- Shane Settle et al., “Query-by-example search with discriminative neural acoustic word embeddings,” in Proc. Interspeech, 2017, pp. 2874–2878.
- Loren Lugosch et al., “Donut: Ctc-based query-by-example keyword spotting,” ArXiv, vol. abs/1811.10736, 2018.
- Junyao Zhan et al., “A stage match for query-by-example spoken term detection based on structure information of query,” in Proc. IEEE ICASSP, 2021, pp. 6833–6837.
- Jinmiao Huang et al., “Query-by-example keyword spotting system using multi-head attention and soft-triple loss,” in Proc. IEEE ICASSP, 2021, pp. 6858–6862.
- Kiran R. et al., “Generalized keyword spotting using asr embeddings,” in Proc. Interspeech 2022, 2022.
- Paul Reuter et al., “Multilingual query-by-example keyword spotting with metric learning and phoneme-to-embedding mapping,” in Proc. IEEE ICASSP, 2023.
- Jaemin Jung et al., “Metric learning for user-defined keyword spotting,” in Proc. IEEE ICASSP, 2023.
- Niccolò Sacchi et al., “Open-vocabulary keyword spotting with audio and text embeddings,” in Proc. Interspeech, 2019, pp. 3362–3366.
- Hyeon-Kyeong Shin et al., “Learning audio-text agreement for open-vocabulary keyword spotting,” in Proc. Interspeech, 2022.
- Kumari Nishu et al., “Matching Latent Encoding for Audio-Text based Keyword Spotting,” in Proc. Interspeech, 2023, pp. 1613–1617.
- “PhonMatchNet: Phoneme-Guided Zero-Shot Keyword Spotting for User-Defined Keywords,” in Proc. Interspeech, 2023, pp. 3964–3968.
- “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
- Ting Chen et al., “A simple framework for contrastive learning of visual representations,” in Proc. ICML, 2020, pp. 1597–1607.
- Shiliang Zhang et al., “Deep-FSMN for large vocabulary continuous speech recognition,” in Proc. IEEE ICASSP. 2018, pp. 5869–5873, IEEE.
- Sebastian Ruder, “An overview of gradient descent optimization algorithms,” ArXiv, vol. abs/1609.04747, 2016.