A Multi-signal LLM for Device-Directedness Detection
Introduction
The paper introduces a novel architecture for device-directedness detection, a crucial component in enhancing the natural interaction between humans and virtual assistants. Traditional systems rely on trigger phrases or button presses to discern when a user is addressing a device. However, this model aims to identify device-directed speech without these explicit cues, facilitating a more seamless user experience.
Methodology
The authors propose a multi-modal approach that combines acoustic, lexical, and confidence information derived from an automatic speech recognition (ASR) system. This combination is innovative, leveraging the strengths of each data type to compensate for their individual limitations, such as background noise affecting acoustic data or ambiguous phrasing impacting lexical data. The core of the method involves using a pre-trained LLM to generate a decision on whether a phrase is directed at a device, using a sequence of continuous embeddings from an audio encoder as a prefix token.
Data
The training dataset comprises approximately 40k directed and 40k non-directed utterances, enriched with about 3 million additional utterances of transcribed speech for textual data augmentation. The evaluation leveraged three datasets to assess performance comprehensively.
Technical Details
The architecture ingeniously integrates:
- Audio Encoder: Utilizes Whisper, a robust speech-to-text model, to process the audio input into a meaningful, compressed representation.
- Decoder Signals: Extracts additional contextual cues from the ASR's lattice decoder, including graph and acoustic costs, and confidence scores.
- LLM: Employs GPT2, demonstrating its versatility by adapting it to the device-directedness detection task through fine-tuning with both acoustic and lexical information.
Experiments and Results
The results underscore the effectiveness of the proposed multi-modal approach, showing substantial improvements over unimodal (text-only or audio-only) models. Specifically, the model achieved:
- A remarkable 38.9% improvement in Equal Error Rate (EER) over text-only models.
- A 20.5% improvement over the best performing audio-only model.
These improvements highlight the model's capacity to leverage multi-modal data synergistically, offering a more accurate and reliable system for detecting device-directed speech.
Implications and Future Directions
This research has significant theoretical and practical implications:
- Theoretical: Demonstrates the viability of treating device-directedness detection as a text-generation problem and confirms the potential of multi-modal learning in LLMs.
- Practical: Offers an avenue for more natural and efficient interactions between users and virtual assistants by reducing the reliance on explicit activation cues.
The authors suggest future work could explore extending the model to additional tasks relevant to virtual assistants, such as audio captioning or acoustic scene classification. This direction could further enhance the utility and applicability of LLMs in human-computer interaction contexts.
Conclusion
The paper presents a compelling approach to device-directedness detection, advancing the field with its multi-modal and text-generation framework. By harnessing the combined strengths of acoustic and lexical information, the model sets a new standard for natural user interactions with virtual assistants, paving the way for more intuitive and efficient communication technologies.