Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models (2312.03632v1)

Published 6 Dec 2023 in cs.SD, cs.LG, and eess.AS

Abstract: Interactions with virtual assistants typically start with a trigger phrase followed by a command. In this work, we explore the possibility of making these interactions more natural by eliminating the need for a trigger phrase. Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone. We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder as input features to a LLM. In particular, we are interested in data and resource efficient systems that require only a small amount of training data and can operate in scenarios with only a single frozen LLM available on a device. For this reason, our model is trained on 80k or less examples of multimodal data using a combination of low-rank adaptation and prefix tuning. We compare the proposed system to unimodal baselines and show that the multimodal approach achieves lower equal-error-rates (EERs), while using only a fraction of the training data. We also show that low-dimensional specialized audio representations lead to lower EERs than high-dimensional general audio representations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Siri Team, “Voice trigger system for Siri.” https://machinelearning.apple.com/research/voice-trigger, 2023.
  2. C. Jose, Y. Mishchenko, T. Sénéchal, A. Shah, A. Escott, and S. N. P. Vitaladevuni, “Accurate Detection of Wake Word Start and End Using a CNN,” in Interspeech, 2020.
  3. A. Ghosh, M. Fuhs, D. Bagchi, B. Farahani, and M. Woszczyna, “Low-resource Low-footprint Wake-word Detection using Knowledge Distillation,” in Interspeech, 2022.
  4. S. Sigtia, R. Haynes, H. Richards, E. Marchi, and J. Bridle, “Efficient Voice Trigger Detection for Low Resource Hardware,” in Interspeech, 2018.
  5. S. Sigtia, E. Marchi, S. Kajarekar, D. Naik, and J. Bridle, “Multi-task learning for speaker verification and voice trigger detection,” in ICASSP, 2020.
  6. T. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in Interspeech, 2015.
  7. A. H. Michaely, X. Zhang, G. Simko, C. Parada, and P. Aleksic, “Keyword spotting for google assistant using contextual speech recognition,” in ASRU, 2017.
  8. S. Cornell, T. Balestri, and T. Sénéchal, “Implicit acoustic echo cancellation for keyword spotting and device-directed speech detection,” in 2022 IEEE Spoken Language Technology Workshop (SLT), 2023.
  9. D. Ng, R. Zhang, J. Q. Yip, C. Zhang, Y. Ma, T. H. Nguyen, C. Ni, E. S. Chng, and B. Ma, “Contrastive speech mixup for low-resource keyword spotting,” in ICASSP, 2023.
  10. E. Shriberg, A. Stolcke, D. Hakkani-Tür, and L. Heck, “Learning when to listen: detecting system-addressed speech in human-human-computer dialog,” in Interspeech, 2012.
  11. S. H. Mallidi, R. Maas, K. Goehner, A. Rastrow, S. Matsoukas, and B. Hoffmeister, “Device-directed Utterance Detection,” in Interspeech, 2018.
  12. V. Garg, O. Rudovic, P. Dighe, A. H. Abdelaziz, E. Marchi, S. Adya, C. Dhir, and A. Tewfik, “Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models,” in Interspeech, 2022.
  13. K. Gillespie, I. C. Konstantakopoulos, X. Guo, V. T. Vasudevan, and A. Sethy, “Improving device directedness classification of utterances with semantic lexical features,” in ICASSP, 2020.
  14. H. Sato, Y. Shinohara, and A. Ogawa, “Multi-modal modeling for device-directed speech detection using acoustic and linguistic cues,” Acoustical Science and Technology, vol. 44, no. 1, pp. 40–43, 2023.
  15. D. Bekal, S. Srinivasan, S. Ronanki, S. Bodapati, and K. Kirchhoff, “Contextual Acoustic Barge-In Classification for Spoken Dialog Systems,” in Interspeech, 2022.
  16. R. Mokady, A. Hertz, and A. H. Bermano, “ClipCap: CLIP prefix for image captioning,” 2021. arXiv:2111.09734.
  17. D. Driess et al., “PaLM-E: An embodied multimodal language model,” 2023. arXiv:2303.03378.
  18. Y. Fathullah et al., “Prompting large language models with speech recognition abilities,” 2023. arXiv:2307.11795.
  19. Y. Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. Glass, “Listen, think, and understand,” 2023. arXiv:2305.10790.
  20. M. Kim, K. Sung-Bin, and T.-H. Oh, “Prefix tuning for automated audio captioning,” in ICASSP, 2023.
  21. S. Deshmukh, B. Elizalde, R. Singh, and H. Wang, “Pengi: An audio language model for audio tasks,” 2023. arXiv:2305.11834.
  22. X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), (Online), pp. 4582–4597, Association for Computational Linguistics, Aug. 2021.
  23. E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in ICLR, 2022.
  24. S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in ICASSP, 2017.
  25. M. Bleeker, P. Swietojanski, S. Braun, and X. Zhuang, “Approximate Nearest Neighbour Phrase Mining for Contextual Speech Recognition,” in Interspeech, 2023.
  26. Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding,” in ASRU, 2015.
  27. D. Povey, M. Hannemann, G. Boulianne, L. Burget, A. Ghoshal, M. Janda, M. Karafiát, S. Kombrink, P. Motlíček, Y. Qian, K. Riedhammer, K. Veselý, and N. T. Vu, “Generating exact lattices in the WFST framework,” in ICASSP, 2012.
  28. A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022.
  29. O. Rudovic, W. Chang, V. Garg, P. Dighe, P. Simha, J. Berkowitz, A. H. Abdelaziz, S. Kajarekar, E. Marchi, and S. Adya, “Less is more: A unified architecture for device-directed speech detection with multiple invocation types,” in ICASSP, 2023.
  30. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017.
  31. T. B. Brown et al., “Language models are few-shot learners,” 2020. arXiv:2005.14165.
  32. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.
  33. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” JMLR, vol. 21, no. 140, pp. 1–67, 2020.
  34. E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, E. Goffinet, D. Heslow, J. Launay, Q. Malartic, B. Noune, B. Pannier, and G. Penedo, “Falcon-40B: an open large language model with state-of-the-art performance,” 2023.
  35. Together Computer, “RedPajama: An open source recipe to reproduce LLaMA training dataset,” 2023.
  36. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 56, pp. 1929–1958, 2014.
  37. J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” 2020.
  38. P. Dighe, P. Nayak, O. Rudovic, E. Marchi, X. Niu, and A. Tewfik, “Audio-to-intent using acoustic-textual subword representations from end-to-end asr,” in ICASSP, 2023.
  39. O. Rudovic, A. Bindal, V. Garg, P. Simha, P. Dighe, and S. Kajarekar, “Streaming on-device detection of device directed speech from voice and touch-based invocation,” in ICASSP, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Dominik Wagner (29 papers)
  2. Alexander Churchill (3 papers)
  3. Siddharth Sigtia (15 papers)
  4. Panayiotis Georgiou (32 papers)
  5. Matt Mirsamadi (2 papers)
  6. Aarshee Mishra (3 papers)
  7. Erik Marchi (18 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com