Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Open vocabulary keyword spotting through transfer learning from speech synthesis (2404.03914v2)

Published 5 Apr 2024 in cs.HC, cs.SD, and eess.AS

Abstract: Identifying keywords in an open-vocabulary context is crucial for personalizing interactions with smart devices. Previous approaches to open vocabulary keyword spotting dependon a shared embedding space created by audio and text encoders. However, these approaches suffer from heterogeneous modality representations (i.e., audio-text mismatch). To address this issue, our proposed framework leverages knowledge acquired from a pre-trained text-to-speech (TTS) system. This knowledge transfer allows for the incorporation of awareness of audio projections into the text representations derived from the text encoder. The performance of the proposed approach is compared with various baseline methods across four different datasets. The robustness of our proposed model is evaluated by assessing its performance across different word lengths and in an Out-of-Vocabulary (OOV) scenario. Additionally, the effectiveness of transfer learning from the TTS system is investigated by analyzing its different intermediate representations. The experimental results indicate that, in the challenging LibriPhrase Hard dataset, the proposed approach outperformed the cross-modality correspondence detector (CMCD) method by a significant improvement of 8.22% in area under the curve (AUC) and 12.56% in equal error rate (EER).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. “Query-by-example keyword spotting system using multi-head attention and soft-triple loss” In Proc. ICASSP, 2021, pp. 6858–6862 IEEE
  2. “Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining” In Proc. Interspeech, 2020, pp. 4676–4680
  3. Keith Ito “The LJ Speech Dataset”, https://keithito.com/LJ-Speech-Dataset/, 2017
  4. “Query-by-example on-device keyword spotting” In Proc. ASRU, 2019, pp. 532–538 IEEE
  5. Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In arXiv preprint arXiv:1412.6980, 2014
  6. “Generalized Keyword Spotting using ASR embeddings” In Proc. INTERSPEECH, 2022, pp. 126–130
  7. “Looking into your speech: Learning cross-modal affinity for audio-visual speech separation” In Proc. CVPR, 2021, pp. 1336–1345
  8. “PhonMatchNet: Phoneme-Guided Zero-Shot Keyword Spotting for User-Defined Keywords” In Proc. INTERSPEECH, 2023, pp. 3964–3968
  9. Vladimir I Levenshtein “Binary codes capable of correcting deletions, insertions, and reversals” In Soviet physics doklady 10.8, 1966, pp. 707–710 Soviet Union
  10. Zuozhen Liu, Ta Li and Pengyuan Zhang “Neural keyword confidence estimation for open-vocabulary keyword spotting” In Electronics Letters 58.3 IET, 2021, pp. 133–135
  11. “Deep spoken keyword spotting: An overview” In IEEE Access 10 IEEE, 2021, pp. 4169–4199
  12. L. Lugosch, S. Myer and V.S. Tomar “DONUT: CTC-based Query-by-Example Keyword Spotting” In NeurIPS Workshop on Interpretability and Robustness in Audio, Speech, and Language, 2018
  13. Kumari Nishu, Minsik Cho and Devang Naik “Matching Latent Encoding for Audio-Text based Keyword Spotting” In Proc. INTERSPEECH, 2023, pp. 1613–1617
  14. NVIDIA Corporation “Pretrained Tacotron2 model”, https://github.com/NVIDIA/tacotron2
  15. “Librispeech: an asr corpus based on public domain audio books” In Proc. ICASSP, 2015, pp. 5206–5210 IEEE
  16. “Open-vocabulary keyword spotting with audio and text embeddings” In INTERSPEECH 2019-IEEE International Conference on Acoustics, Speech, and Signal Processing, 2019
  17. Tara N. Sainath and Carolina Parada “Convolutional neural networks for small-footprint keyword spotting” In Proc. Interspeech, 2015, pp. 1478–1482
  18. “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions” In Proc. ICASSP, 2018, pp. 4779–4783 IEEE
  19. “Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting” In Proc. INTERSPEECH, 2022, pp. 1871–1875
  20. “Deep residual learning for small-footprint keyword spotting” In Proc. ICASSP, 2018, pp. 5484–5488 IEEE
  21. “Attention is all you need” In Advances in neural information processing systems 30, 2017
  22. Pete Warden “Speech commands: A dataset for limited-vocabulary speech recognition” In arXiv preprint arXiv:1804.03209, 2018
Citations (3)

Summary

We haven't generated a summary for this paper yet.