A Multimodal Approach to Device-Directed Speech Detection with Large Language Models (2403.14438v2)

Published 21 Mar 2024 in cs.CL, cs.LG, and eess.AS

Abstract: Interactions with virtual assistants typically start with a predefined trigger phrase followed by the user command. To make interactions with the assistant more intuitive, we explore whether it is feasible to drop the requirement that users must begin each command with a trigger phrase. We explore this task in three ways: First, we train classifiers using only acoustic information obtained from the audio waveform. Second, we take the decoder outputs of an automatic speech recognition (ASR) system, such as 1-best hypotheses, as input features to a LLM. Finally, we explore a multimodal system that combines acoustic and lexical features, as well as ASR decoder signals in an LLM. Using multimodal information yields relative equal-error-rate improvements over text-only and audio-only models of up to 39% and 61%. Increasing the size of the LLM and training with low-rank adaption leads to further relative EER reductions of up to 18% on our dataset.

References (50)

Authors (7)

Alexander Churchill (3 papers)
Siddharth Sigtia (15 papers)
Panayiotis Georgiou (32 papers)
Matt Mirsamadi (2 papers)
Aarshee Mishra (3 papers)
Erik Marchi (18 papers)
Dominik Wagner (29 papers)

Citations (4)

View on Semantic Scholar

Summary

A Multi-signal LLM for Device-Directedness Detection

Introduction

The paper introduces a novel architecture for device-directedness detection, a crucial component in enhancing the natural interaction between humans and virtual assistants. Traditional systems rely on trigger phrases or button presses to discern when a user is addressing a device. However, this model aims to identify device-directed speech without these explicit cues, facilitating a more seamless user experience.

Methodology

The authors propose a multi-modal approach that combines acoustic, lexical, and confidence information derived from an automatic speech recognition (ASR) system. This combination is innovative, leveraging the strengths of each data type to compensate for their individual limitations, such as background noise affecting acoustic data or ambiguous phrasing impacting lexical data. The core of the method involves using a pre-trained LLM to generate a decision on whether a phrase is directed at a device, using a sequence of continuous embeddings from an audio encoder as a prefix token.

Data

The training dataset comprises approximately 40k directed and 40k non-directed utterances, enriched with about 3 million additional utterances of transcribed speech for textual data augmentation. The evaluation leveraged three datasets to assess performance comprehensively.

Technical Details

The architecture ingeniously integrates:

Audio Encoder: Utilizes Whisper, a robust speech-to-text model, to process the audio input into a meaningful, compressed representation.
Decoder Signals: Extracts additional contextual cues from the ASR's lattice decoder, including graph and acoustic costs, and confidence scores.
LLM: Employs GPT2, demonstrating its versatility by adapting it to the device-directedness detection task through fine-tuning with both acoustic and lexical information.

Experiments and Results

The results underscore the effectiveness of the proposed multi-modal approach, showing substantial improvements over unimodal (text-only or audio-only) models. Specifically, the model achieved:

A remarkable 38.9% improvement in Equal Error Rate (EER) over text-only models.
A 20.5% improvement over the best performing audio-only model.

These improvements highlight the model's capacity to leverage multi-modal data synergistically, offering a more accurate and reliable system for detecting device-directed speech.

Implications and Future Directions

This research has significant theoretical and practical implications:

Theoretical: Demonstrates the viability of treating device-directedness detection as a text-generation problem and confirms the potential of multi-modal learning in LLMs.
Practical: Offers an avenue for more natural and efficient interactions between users and virtual assistants by reducing the reliance on explicit activation cues.

The authors suggest future work could explore extending the model to additional tasks relevant to virtual assistants, such as audio captioning or acoustic scene classification. This direction could further enhance the utility and applicability of LLMs in human-computer interaction contexts.

Conclusion

The paper presents a compelling approach to device-directedness detection, advancing the field with its multi-modal and text-generation framework. By harnessing the combined strengths of acoustic and lexical information, the model sets a new standard for natural user interactions with virtual assistants, paving the way for more intuitive and efficient communication technologies.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/gm8xx8/status/1771021388633010682

https://twitter.com/Eye_Bee_Leaf/status/1781061144343666892

https://twitter.com/ryo694/status/1772269240059576460

https://twitter.com/AudioAndSpeech/status/1771067839203434717

https://twitter.com/arxivsanitybot/status/1771356074639077736

https://twitter.com/AudioAndSpeech/status/1773012529511280641