Distilling an End-to-End Voice Assistant Without Instruction Training Data (2410.02678v1)

Published 3 Oct 2024 in cs.CL and cs.AI

Abstract: Voice assistants, such as Siri and Google Assistant, typically model audio and text separately, resulting in lost speech information and increased complexity. Recent efforts to address this with end-to-end Speech LLMs trained with supervised finetuning (SFT) have led to models ``forgetting" capabilities from text-only LLMs. Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision. Importantly, this process can be performed without annotated responses. We show that our Distilled Voice Assistant (DiVA) generalizes to Spoken Question Answering, Classification, and Translation. Furthermore, we show that DiVA better meets user preferences, achieving a 72\% win rate compared with state-of-the-art models like Qwen 2 Audio, despite using $>$100x less training compute.

Citations (1)

View on Semantic Scholar

Summary

The paper presents DiVA, a model that uses cross-modal distillation from text-only LLM outputs to train an efficient voice assistant without instruction data.
The methodology combines a Whisper encoder with a Querying Transformer, aligning audio features and hidden states through L2 loss to maintain performance.
Results show a 72% win rate in user preference for spoken QA, alongside effective emotion recognition and translation, indicating strong multimodal capabilities.

Distilling an End-to-End Voice Assistant Without Instruction Training Data

The paper "Distilling an End-to-End Voice Assistant Without Instruction Training Data" presents a novel approach to training Speech LLMs without relying on instruction data. The authors focus on overcoming challenges associated with the traditional methods of finetuning such models, which can lead to increased complexity and a loss of capabilities when transitioning from text-only LLMs to those that handle speech.

Key Contributions and Methodology

The authors introduce the Distilled Voice Assistant (DiVA), which is trained using self-supervision from a text-only LLM responding to transcripts rather than annotated instruction data. This approach utilizes a cross-modal distillation setup where speech is used to generate responses through a learned alignment with text-based outputs.

Architecture

Audio Feature Extraction: DiVA uses the Whisper encoder to process audio, maintaining the architecture's efficiency by preserving the encoder's structure while incorporating a Querying Transformer (Q-Former) initialized from Whisper’s decoder.
Cross-Modal Alignment: The model distills knowledge from the text modality to the speech modality, applying a combination of token alignment and hidden state alignment losses. Instead of depending on KL Divergence directly, the approach minimizes L2 loss between hidden states, ensuring that the model's outputs align closely with desired distributions.

Results and Evaluation

The evaluation focuses on several key tasks, including spoken question answering, speech classification, and translation capabilities. DiVA demonstrates strong performance across these benchmarks, notably achieving a 72\% win rate in user preference studies over state-of-the-art models like Qwen 2 Audio.

Quantitative Performance

Spoken Question Answering: DiVA exhibits improvements in accuracy compared to baselines, highlighting its ability to generalize from speech data alone without explicit task-specific training.
Speech Classification: Despite limited tone supervision, DiVA effectively handles tasks such as emotion recognition, suggesting that its approach to contextual distillation captures relevant sociophonetic cues.
Speech Translation: The model shows competence across several typologies, although variations in specific languages point to areas for further improvement.

Implications and Future Directions

This research shifts the paradigm by using cross-modal distillation to train Speech LLMs efficiently, reducing reliance on extensive labeled data and computational resources. It opens pathways for more sustainable development of multimodal models, leveraging existing LLMs' capabilities without exhaustive retraining.

Future work could expand this approach to other modalities, exploring the scalability of context distillation in more complex environments. Additionally, addressing the nuances of tone and cultural context in language processing remains a critical area for development.

In conclusion, this work represents a significant step in efficiently bridging audio and text modalities, demonstrating that innovative training methodologies can yield robust, user-preferred voice assistants with minimized resource expenditure.

PDF Markdown

Related Papers

Tweets

https://twitter.com/WilliamBarrHeld/status/1842327166643793978

https://twitter.com/fly51fly/status/1843046675402899860

https://twitter.com/arXivGPT/status/1843376462906200073

https://twitter.com/arXivGPT/status/1843013940344693084

https://twitter.com/arXivGPT/status/1842648441580609656