A Multimodal Approach to Device-Directed Speech Detection with Large Language Models (2403.14438v2)
Abstract: Interactions with virtual assistants typically start with a predefined trigger phrase followed by the user command. To make interactions with the assistant more intuitive, we explore whether it is feasible to drop the requirement that users must begin each command with a trigger phrase. We explore this task in three ways: First, we train classifiers using only acoustic information obtained from the audio waveform. Second, we take the decoder outputs of an automatic speech recognition (ASR) system, such as 1-best hypotheses, as input features to a LLM. Finally, we explore a multimodal system that combines acoustic and lexical features, as well as ASR decoder signals in an LLM. Using multimodal information yields relative equal-error-rate improvements over text-only and audio-only models of up to 39% and 61%. Increasing the size of the LLM and training with low-rank adaption leads to further relative EER reductions of up to 18% on our dataset.
- Siri Team, “Voice trigger system for Siri,” https://machinelearning.apple.com/research/voice-trigger, 2023.
- “Efficient Voice Trigger Detection for Low Resource Hardware,” in Interspeech, 2018.
- “Multi-task learning for speaker verification and voice trigger detection,” in ICASSP, 2020.
- “Accurate Detection of Wake Word Start and End Using a CNN,” in Interspeech, 2020.
- “Low-resource Low-footprint Wake-word Detection using Knowledge Distillation,” in Interspeech, 2022.
- “Convolutional neural networks for small-footprint keyword spotting,” in Interspeech, 2015.
- “Keyword spotting for google assistant using contextual speech recognition,” in ASRU, 2017.
- Dianwen Ng et al., “Contrastive speech mixup for low-resource keyword spotting,” in ICASSP, 2023.
- “Learning when to listen: detecting system-addressed speech in human-human-computer dialog,” in Interspeech, 2012.
- “Device-directed Utterance Detection,” in Interspeech, 2018.
- Vineet Garg et al., “Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models,” in Interspeech, 2022.
- “Improving device directedness classification of utterances with semantic lexical features,” in ICASSP, 2020.
- Sai Srujana Buddi et al., “Efficient multimodal neural networks for trigger-less voice assistants,” in Interspeech, 2023.
- “Multi-modal modeling for device-directed speech detection using acoustic and linguistic cues,” Acoustical Science and Technology, 2023.
- “Contextual Acoustic Barge-In Classification for Spoken Dialog Systems,” in Interspeech, 2022.
- “Language models are unsupervised multitask learners,” 2019.
- Tom B. Brown et al., “Language models are few-shot learners,” 2020, arXiv:2005.14165.
- OpenAI, “GPT-4 technical report,” 2023, arXiv:2303.08774.
- Danny Driess et al., “PaLM-E: An embodied multimodal language model,” 2023, arXiv:2303.03378.
- Mostafa Dehghani et al., “Scaling vision transformers to 22 billion parameters,” 2023, arXiv:2302.05442.
- Aakanksha Chowdhery et al., “PaLM: Scaling language modeling with pathways,” 2022, arXiv:2204.0231.
- Yassir Fathullah et al., “Prompting large language models with speech recognition abilities,” 2023, arXiv:2307.11795.
- Hugo Touvron et al., “LLaMA: Open and efficient foundation language models,” 2023, arXiv:2302.13971.
- “Listen, think, and understand,” 2023, arXiv:2305.10790.
- “AST: Audio spectrogram transformer,” in Interspeech, 2021.
- “LoRA: Low-rank adaptation of large language models,” in ICLR, 2022.
- “ClipCap: CLIP prefix for image captioning,” 2021, arXiv:2111.09734.
- Alec Radford et al., “Learning transferable visual models from natural language supervision,” 2021, arXiv:2103.00020.
- “Prefix tuning for automated audio captioning,” in ICASSP, 2023.
- “Pengi: An audio language model for audio tasks,” in NeurIPS, 2023.
- “CLAP: Learning audio concepts from natural language supervision,” 2022, arXiv:2206.04769.
- “Whispering LLaMA: A cross-modal generative error correction framework for speech recognition,” in EMNLP, 2023.
- Yunfei Chu et al., “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models,” 2023, arXiv:2311.07919.
- Changli Tang et al., “Salmonn: Towards generic hearing abilities for large language models,” 2023, arXiv:2310.13289.
- Oggi Rudovic et al., “Less is more: A unified architecture for device-directed speech detection with multiple invocation types,” in ICASSP, 2023.
- “Audio-to-intent using acoustic-textual subword representations from end-to-end asr,” in ICASSP, 2023.
- “Moddrop: Adaptive multi-modal gesture recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
- “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in ICASSP, 2017.
- “Approximate Nearest Neighbour Phrase Mining for Contextual Speech Recognition,” in Interspeech, 2023.
- Anmol Gulati et al., “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Interspeech, 2020.
- “Neural machine translation of rare words with subword units,” in ACL, 2016.
- “EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding,” in ASRU, 2015.
- Daniel Povey et al., “Generating exact lattices in the WFST framework,” in ICASSP, 2012.
- “Robust speech recognition via large-scale weak supervision,” 2022, arXiv:2212.04356.
- “Attention is all you need,” in NeurIPS, 2017.
- “BERT: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.
- Colin Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” JMLR, 2020.
- Dominik Wagner et al., “Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models,” in Third Workshop on Efficient Natural Language and Speech Processing (ENLSP-III) at NeurIPS 2023, 2023.
- “Decoupled weight decay regularization,” in ICLR, 2019.
- “The DET curve in assessment of detection task performance,” in Eurospeech, 1997.
- Alexander Churchill (3 papers)
- Siddharth Sigtia (15 papers)
- Panayiotis Georgiou (32 papers)
- Matt Mirsamadi (2 papers)
- Aarshee Mishra (3 papers)
- Erik Marchi (18 papers)
- Dominik Wagner (29 papers)