Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Multimodal Approach to Device-Directed Speech Detection with Large Language Models (2403.14438v2)

Published 21 Mar 2024 in cs.CL, cs.LG, and eess.AS

Abstract: Interactions with virtual assistants typically start with a predefined trigger phrase followed by the user command. To make interactions with the assistant more intuitive, we explore whether it is feasible to drop the requirement that users must begin each command with a trigger phrase. We explore this task in three ways: First, we train classifiers using only acoustic information obtained from the audio waveform. Second, we take the decoder outputs of an automatic speech recognition (ASR) system, such as 1-best hypotheses, as input features to a LLM. Finally, we explore a multimodal system that combines acoustic and lexical features, as well as ASR decoder signals in an LLM. Using multimodal information yields relative equal-error-rate improvements over text-only and audio-only models of up to 39% and 61%. Increasing the size of the LLM and training with low-rank adaption leads to further relative EER reductions of up to 18% on our dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Siri Team, “Voice trigger system for Siri,” https://machinelearning.apple.com/research/voice-trigger, 2023.
  2. “Efficient Voice Trigger Detection for Low Resource Hardware,” in Interspeech, 2018.
  3. “Multi-task learning for speaker verification and voice trigger detection,” in ICASSP, 2020.
  4. “Accurate Detection of Wake Word Start and End Using a CNN,” in Interspeech, 2020.
  5. “Low-resource Low-footprint Wake-word Detection using Knowledge Distillation,” in Interspeech, 2022.
  6. “Convolutional neural networks for small-footprint keyword spotting,” in Interspeech, 2015.
  7. “Keyword spotting for google assistant using contextual speech recognition,” in ASRU, 2017.
  8. Dianwen Ng et al., “Contrastive speech mixup for low-resource keyword spotting,” in ICASSP, 2023.
  9. “Learning when to listen: detecting system-addressed speech in human-human-computer dialog,” in Interspeech, 2012.
  10. “Device-directed Utterance Detection,” in Interspeech, 2018.
  11. Vineet Garg et al., “Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models,” in Interspeech, 2022.
  12. “Improving device directedness classification of utterances with semantic lexical features,” in ICASSP, 2020.
  13. Sai Srujana Buddi et al., “Efficient multimodal neural networks for trigger-less voice assistants,” in Interspeech, 2023.
  14. “Multi-modal modeling for device-directed speech detection using acoustic and linguistic cues,” Acoustical Science and Technology, 2023.
  15. “Contextual Acoustic Barge-In Classification for Spoken Dialog Systems,” in Interspeech, 2022.
  16. “Language models are unsupervised multitask learners,” 2019.
  17. Tom B. Brown et al., “Language models are few-shot learners,” 2020, arXiv:2005.14165.
  18. OpenAI, “GPT-4 technical report,” 2023, arXiv:2303.08774.
  19. Danny Driess et al., “PaLM-E: An embodied multimodal language model,” 2023, arXiv:2303.03378.
  20. Mostafa Dehghani et al., “Scaling vision transformers to 22 billion parameters,” 2023, arXiv:2302.05442.
  21. Aakanksha Chowdhery et al., “PaLM: Scaling language modeling with pathways,” 2022, arXiv:2204.0231.
  22. Yassir Fathullah et al., “Prompting large language models with speech recognition abilities,” 2023, arXiv:2307.11795.
  23. Hugo Touvron et al., “LLaMA: Open and efficient foundation language models,” 2023, arXiv:2302.13971.
  24. “Listen, think, and understand,” 2023, arXiv:2305.10790.
  25. “AST: Audio spectrogram transformer,” in Interspeech, 2021.
  26. “LoRA: Low-rank adaptation of large language models,” in ICLR, 2022.
  27. “ClipCap: CLIP prefix for image captioning,” 2021, arXiv:2111.09734.
  28. Alec Radford et al., “Learning transferable visual models from natural language supervision,” 2021, arXiv:2103.00020.
  29. “Prefix tuning for automated audio captioning,” in ICASSP, 2023.
  30. “Pengi: An audio language model for audio tasks,” in NeurIPS, 2023.
  31. “CLAP: Learning audio concepts from natural language supervision,” 2022, arXiv:2206.04769.
  32. “Whispering LLaMA: A cross-modal generative error correction framework for speech recognition,” in EMNLP, 2023.
  33. Yunfei Chu et al., “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models,” 2023, arXiv:2311.07919.
  34. Changli Tang et al., “Salmonn: Towards generic hearing abilities for large language models,” 2023, arXiv:2310.13289.
  35. Oggi Rudovic et al., “Less is more: A unified architecture for device-directed speech detection with multiple invocation types,” in ICASSP, 2023.
  36. “Audio-to-intent using acoustic-textual subword representations from end-to-end asr,” in ICASSP, 2023.
  37. “Moddrop: Adaptive multi-modal gesture recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016.
  38. “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in ICASSP, 2017.
  39. “Approximate Nearest Neighbour Phrase Mining for Contextual Speech Recognition,” in Interspeech, 2023.
  40. Anmol Gulati et al., “Conformer: Convolution-augmented Transformer for Speech Recognition,” in Interspeech, 2020.
  41. “Neural machine translation of rare words with subword units,” in ACL, 2016.
  42. “EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding,” in ASRU, 2015.
  43. Daniel Povey et al., “Generating exact lattices in the WFST framework,” in ICASSP, 2012.
  44. “Robust speech recognition via large-scale weak supervision,” 2022, arXiv:2212.04356.
  45. “Attention is all you need,” in NeurIPS, 2017.
  46. “BERT: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.
  47. Colin Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” JMLR, 2020.
  48. Dominik Wagner et al., “Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models,” in Third Workshop on Efficient Natural Language and Speech Processing (ENLSP-III) at NeurIPS 2023, 2023.
  49. “Decoupled weight decay regularization,” in ICLR, 2019.
  50. “The DET curve in assessment of detection task performance,” in Eurospeech, 1997.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Alexander Churchill (3 papers)
  2. Siddharth Sigtia (15 papers)
  3. Panayiotis Georgiou (32 papers)
  4. Matt Mirsamadi (2 papers)
  5. Aarshee Mishra (3 papers)
  6. Erik Marchi (18 papers)
  7. Dominik Wagner (29 papers)
Citations (4)

Summary

A Multi-signal LLM for Device-Directedness Detection

Introduction

The paper introduces a novel architecture for device-directedness detection, a crucial component in enhancing the natural interaction between humans and virtual assistants. Traditional systems rely on trigger phrases or button presses to discern when a user is addressing a device. However, this model aims to identify device-directed speech without these explicit cues, facilitating a more seamless user experience.

Methodology

The authors propose a multi-modal approach that combines acoustic, lexical, and confidence information derived from an automatic speech recognition (ASR) system. This combination is innovative, leveraging the strengths of each data type to compensate for their individual limitations, such as background noise affecting acoustic data or ambiguous phrasing impacting lexical data. The core of the method involves using a pre-trained LLM to generate a decision on whether a phrase is directed at a device, using a sequence of continuous embeddings from an audio encoder as a prefix token.

Data

The training dataset comprises approximately 40k directed and 40k non-directed utterances, enriched with about 3 million additional utterances of transcribed speech for textual data augmentation. The evaluation leveraged three datasets to assess performance comprehensively.

Technical Details

The architecture ingeniously integrates:

  • Audio Encoder: Utilizes Whisper, a robust speech-to-text model, to process the audio input into a meaningful, compressed representation.
  • Decoder Signals: Extracts additional contextual cues from the ASR's lattice decoder, including graph and acoustic costs, and confidence scores.
  • LLM: Employs GPT2, demonstrating its versatility by adapting it to the device-directedness detection task through fine-tuning with both acoustic and lexical information.

Experiments and Results

The results underscore the effectiveness of the proposed multi-modal approach, showing substantial improvements over unimodal (text-only or audio-only) models. Specifically, the model achieved:

  • A remarkable 38.9% improvement in Equal Error Rate (EER) over text-only models.
  • A 20.5% improvement over the best performing audio-only model.

These improvements highlight the model's capacity to leverage multi-modal data synergistically, offering a more accurate and reliable system for detecting device-directed speech.

Implications and Future Directions

This research has significant theoretical and practical implications:

  • Theoretical: Demonstrates the viability of treating device-directedness detection as a text-generation problem and confirms the potential of multi-modal learning in LLMs.
  • Practical: Offers an avenue for more natural and efficient interactions between users and virtual assistants by reducing the reliance on explicit activation cues.

The authors suggest future work could explore extending the model to additional tasks relevant to virtual assistants, such as audio captioning or acoustic scene classification. This direction could further enhance the utility and applicability of LLMs in human-computer interaction contexts.

Conclusion

The paper presents a compelling approach to device-directedness detection, advancing the field with its multi-modal and text-generation framework. By harnessing the combined strengths of acoustic and lexical information, the model sets a new standard for natural user interactions with virtual assistants, paving the way for more intuitive and efficient communication technologies.