- The paper presents DiVA, a model that uses cross-modal distillation from text-only LLM outputs to train an efficient voice assistant without instruction data.
- The methodology combines a Whisper encoder with a Querying Transformer, aligning audio features and hidden states through L2 loss to maintain performance.
- Results show a 72% win rate in user preference for spoken QA, alongside effective emotion recognition and translation, indicating strong multimodal capabilities.
Distilling an End-to-End Voice Assistant Without Instruction Training Data
The paper "Distilling an End-to-End Voice Assistant Without Instruction Training Data" presents a novel approach to training Speech LLMs without relying on instruction data. The authors focus on overcoming challenges associated with the traditional methods of finetuning such models, which can lead to increased complexity and a loss of capabilities when transitioning from text-only LLMs to those that handle speech.
Key Contributions and Methodology
The authors introduce the Distilled Voice Assistant (DiVA), which is trained using self-supervision from a text-only LLM responding to transcripts rather than annotated instruction data. This approach utilizes a cross-modal distillation setup where speech is used to generate responses through a learned alignment with text-based outputs.
Architecture
- Audio Feature Extraction: DiVA uses the Whisper encoder to process audio, maintaining the architecture's efficiency by preserving the encoder's structure while incorporating a Querying Transformer (Q-Former) initialized from Whisper’s decoder.
- Cross-Modal Alignment: The model distills knowledge from the text modality to the speech modality, applying a combination of token alignment and hidden state alignment losses. Instead of depending on KL Divergence directly, the approach minimizes L2 loss between hidden states, ensuring that the model's outputs align closely with desired distributions.
Results and Evaluation
The evaluation focuses on several key tasks, including spoken question answering, speech classification, and translation capabilities. DiVA demonstrates strong performance across these benchmarks, notably achieving a 72\% win rate in user preference studies over state-of-the-art models like Qwen 2 Audio.
- Spoken Question Answering: DiVA exhibits improvements in accuracy compared to baselines, highlighting its ability to generalize from speech data alone without explicit task-specific training.
- Speech Classification: Despite limited tone supervision, DiVA effectively handles tasks such as emotion recognition, suggesting that its approach to contextual distillation captures relevant sociophonetic cues.
- Speech Translation: The model shows competence across several typologies, although variations in specific languages point to areas for further improvement.
Implications and Future Directions
This research shifts the paradigm by using cross-modal distillation to train Speech LLMs efficiently, reducing reliance on extensive labeled data and computational resources. It opens pathways for more sustainable development of multimodal models, leveraging existing LLMs' capabilities without exhaustive retraining.
Future work could expand this approach to other modalities, exploring the scalability of context distillation in more complex environments. Additionally, addressing the nuances of tone and cultural context in language processing remains a critical area for development.
In conclusion, this work represents a significant step in efficiently bridging audio and text modalities, demonstrating that innovative training methodologies can yield robust, user-preferred voice assistants with minimized resource expenditure.