Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models (2312.03632v1)
Abstract: Interactions with virtual assistants typically start with a trigger phrase followed by a command. In this work, we explore the possibility of making these interactions more natural by eliminating the need for a trigger phrase. Our goal is to determine whether a user addressed the virtual assistant based on signals obtained from the streaming audio recorded by the device microphone. We address this task by combining 1-best hypotheses and decoder signals from an automatic speech recognition system with acoustic representations from an audio encoder as input features to a LLM. In particular, we are interested in data and resource efficient systems that require only a small amount of training data and can operate in scenarios with only a single frozen LLM available on a device. For this reason, our model is trained on 80k or less examples of multimodal data using a combination of low-rank adaptation and prefix tuning. We compare the proposed system to unimodal baselines and show that the multimodal approach achieves lower equal-error-rates (EERs), while using only a fraction of the training data. We also show that low-dimensional specialized audio representations lead to lower EERs than high-dimensional general audio representations.
- Siri Team, “Voice trigger system for Siri.” https://machinelearning.apple.com/research/voice-trigger, 2023.
- C. Jose, Y. Mishchenko, T. Sénéchal, A. Shah, A. Escott, and S. N. P. Vitaladevuni, “Accurate Detection of Wake Word Start and End Using a CNN,” in Interspeech, 2020.
- A. Ghosh, M. Fuhs, D. Bagchi, B. Farahani, and M. Woszczyna, “Low-resource Low-footprint Wake-word Detection using Knowledge Distillation,” in Interspeech, 2022.
- S. Sigtia, R. Haynes, H. Richards, E. Marchi, and J. Bridle, “Efficient Voice Trigger Detection for Low Resource Hardware,” in Interspeech, 2018.
- S. Sigtia, E. Marchi, S. Kajarekar, D. Naik, and J. Bridle, “Multi-task learning for speaker verification and voice trigger detection,” in ICASSP, 2020.
- T. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in Interspeech, 2015.
- A. H. Michaely, X. Zhang, G. Simko, C. Parada, and P. Aleksic, “Keyword spotting for google assistant using contextual speech recognition,” in ASRU, 2017.
- S. Cornell, T. Balestri, and T. Sénéchal, “Implicit acoustic echo cancellation for keyword spotting and device-directed speech detection,” in 2022 IEEE Spoken Language Technology Workshop (SLT), 2023.
- D. Ng, R. Zhang, J. Q. Yip, C. Zhang, Y. Ma, T. H. Nguyen, C. Ni, E. S. Chng, and B. Ma, “Contrastive speech mixup for low-resource keyword spotting,” in ICASSP, 2023.
- E. Shriberg, A. Stolcke, D. Hakkani-Tür, and L. Heck, “Learning when to listen: detecting system-addressed speech in human-human-computer dialog,” in Interspeech, 2012.
- S. H. Mallidi, R. Maas, K. Goehner, A. Rastrow, S. Matsoukas, and B. Hoffmeister, “Device-directed Utterance Detection,” in Interspeech, 2018.
- V. Garg, O. Rudovic, P. Dighe, A. H. Abdelaziz, E. Marchi, S. Adya, C. Dhir, and A. Tewfik, “Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models,” in Interspeech, 2022.
- K. Gillespie, I. C. Konstantakopoulos, X. Guo, V. T. Vasudevan, and A. Sethy, “Improving device directedness classification of utterances with semantic lexical features,” in ICASSP, 2020.
- H. Sato, Y. Shinohara, and A. Ogawa, “Multi-modal modeling for device-directed speech detection using acoustic and linguistic cues,” Acoustical Science and Technology, vol. 44, no. 1, pp. 40–43, 2023.
- D. Bekal, S. Srinivasan, S. Ronanki, S. Bodapati, and K. Kirchhoff, “Contextual Acoustic Barge-In Classification for Spoken Dialog Systems,” in Interspeech, 2022.
- R. Mokady, A. Hertz, and A. H. Bermano, “ClipCap: CLIP prefix for image captioning,” 2021. arXiv:2111.09734.
- D. Driess et al., “PaLM-E: An embodied multimodal language model,” 2023. arXiv:2303.03378.
- Y. Fathullah et al., “Prompting large language models with speech recognition abilities,” 2023. arXiv:2307.11795.
- Y. Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. Glass, “Listen, think, and understand,” 2023. arXiv:2305.10790.
- M. Kim, K. Sung-Bin, and T.-H. Oh, “Prefix tuning for automated audio captioning,” in ICASSP, 2023.
- S. Deshmukh, B. Elizalde, R. Singh, and H. Wang, “Pengi: An audio language model for audio tasks,” 2023. arXiv:2305.11834.
- X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), (Online), pp. 4582–4597, Association for Computational Linguistics, Aug. 2021.
- E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in ICLR, 2022.
- S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in ICASSP, 2017.
- M. Bleeker, P. Swietojanski, S. Braun, and X. Zhuang, “Approximate Nearest Neighbour Phrase Mining for Contextual Speech Recognition,” in Interspeech, 2023.
- Y. Miao, M. Gowayyed, and F. Metze, “EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding,” in ASRU, 2015.
- D. Povey, M. Hannemann, G. Boulianne, L. Burget, A. Ghoshal, M. Janda, M. Karafiát, S. Kombrink, P. Motlíček, Y. Qian, K. Riedhammer, K. Veselý, and N. T. Vu, “Generating exact lattices in the WFST framework,” in ICASSP, 2012.
- A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” 2022.
- O. Rudovic, W. Chang, V. Garg, P. Dighe, P. Simha, J. Berkowitz, A. H. Abdelaziz, S. Kajarekar, E. Marchi, and S. Adya, “Less is more: A unified architecture for device-directed speech detection with multiple invocation types,” in ICASSP, 2023.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017.
- T. B. Brown et al., “Language models are few-shot learners,” 2020. arXiv:2005.14165.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” JMLR, vol. 21, no. 140, pp. 1–67, 2020.
- E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, E. Goffinet, D. Heslow, J. Launay, Q. Malartic, B. Noune, B. Pannier, and G. Penedo, “Falcon-40B: an open large language model with state-of-the-art performance,” 2023.
- Together Computer, “RedPajama: An open source recipe to reproduce LLaMA training dataset,” 2023.
- N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 56, pp. 1929–1958, 2014.
- J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” 2020.
- P. Dighe, P. Nayak, O. Rudovic, E. Marchi, X. Niu, and A. Tewfik, “Audio-to-intent using acoustic-textual subword representations from end-to-end asr,” in ICASSP, 2023.
- O. Rudovic, A. Bindal, V. Garg, P. Simha, P. Dighe, and S. Kajarekar, “Streaming on-device detection of device directed speech from voice and touch-based invocation,” in ICASSP, 2022.
- Dominik Wagner (29 papers)
- Alexander Churchill (3 papers)
- Siddharth Sigtia (15 papers)
- Panayiotis Georgiou (32 papers)
- Matt Mirsamadi (2 papers)
- Aarshee Mishra (3 papers)
- Erik Marchi (18 papers)