Qwen2-Audio Technical Report (2407.10759v1)

Published 15 Jul 2024 in eess.AS, cs.CL, and cs.LG

Abstract: We introduce the latest progress of Qwen-Audio, a large-scale audio-LLM called Qwen2-Audio, which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions. In contrast to complex hierarchical tags, we have simplified the pre-training process by utilizing natural language prompts for different data and tasks, and have further expanded the data volume. We have boosted the instruction-following capability of Qwen2-Audio and implemented two distinct audio interaction modes for voice chat and audio analysis. In the voice chat mode, users can freely engage in voice interactions with Qwen2-Audio without text input. In the audio analysis mode, users could provide audio and text instructions for analysis during the interaction. Note that we do not use any system prompts to switch between voice chat and audio analysis modes. Qwen2-Audio is capable of intelligently comprehending the content within audio and following voice commands to respond appropriately. For instance, in an audio segment that simultaneously contains sounds, multi-speaker conversations, and a voice command, Qwen2-Audio can directly understand the command and provide an interpretation and response to the audio. Additionally, DPO has optimized the model's performance in terms of factuality and adherence to desired behavior. According to the evaluation results from AIR-Bench, Qwen2-Audio outperformed previous SOTAs, such as Gemini-1.5-pro, in tests focused on audio-centric instruction-following capabilities. Qwen2-Audio is open-sourced with the aim of fostering the advancement of the multi-modal language community.

PDF HTML Abstract

Technical Overview of Qwen2-Audio: A Large-Scale Audio-LLM

The paper introduces Qwen2-Audio, a sophisticated large-scale audio-LLM developed for audio analysis and interaction. The model integrates advanced capabilities for processing varied audio inputs, simplifying pre-training processes, and optimizing model outputs for enhanced instruction-following ability. The research focuses on scaling the instruction-following abilities of Qwen2-Audio without relying on complex hierarchical tags, thereby simplifying the pre-training phase with natural language prompts and expanding the dataset size for more comprehensive learning.

Model Design and Training Methodology

Qwen2-Audio's architecture is composed of an audio encoder and a LLM. The audio encoder is based on the Whisper-large-v3 model, effectively pre-processing audio data by converting raw waveforms into mel-spectrograms. The LLM is built upon the Qwen-7B framework, culminating in an overall model size of 8.2 billion parameters. The training methodology unfolds across three stages—pre-training with natural language prompts, supervised fine-tuning via instruction-based techniques, and Direct Preference Optimization (DPO) to align model behavior with human preferences.

During pre-training, the reliance on hierarchical tags was replaced with language prompts, which improved generalization capability. The model accommodates two interaction modes—Audio Analysis for offline audio examination and Voice Chat for real-time interaction. These modes operate seamlessly without explicit user switching, enabling Qwen2-Audio to interpret both audio and text inputs simultaneously.

Evaluation and Performance Metrics

The paper details an extensive evaluation regimen, illustrating Qwen2-Audio's robust performance across several datasets and tasks, including Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Speech Emotion Recognition (SER), and Vocal Sound Classification (VSC). Using metrics such as Word Error Rate (WER) and BLEU scores, Qwen2-Audio consistently outperformed previous models, establishing new benchmarks in understanding and interpreting diverse audio signals.

Key results highlighted include WER improvements to 1.6% and 3.6% on the Librispeech test-clean and test-other datasets, respectively, and surpassing benchmarks in the Speech-to-Text Translation task across multiple language pairs. The model demonstrated substantial gains in tasks like SER and VSC, with notable improvements over existing counterparts, further solidifying its SOTA capabilities when evaluated through objective metrics from the AIR-Bench.

Implications and Future Directions

Qwen2-Audio's advancements signify a substantial leap in multi-modal LLMing, targeting more natural and expansive audio interactions without the limitations of traditional tagging systems. The model's proficiency presents potential applications across multimedia analysis, intelligent voice assistants, and automated audio transcription systems, offering pathways for enhanced human-computer interaction.

The paper implicitly suggests directions for future exploration, notably in further scaling model parameters and dataset sizes and integrating broader language understanding capabilities to support even more complex audio-visual task scenarios. The open-source nature of Qwen2-Audio invites contributions from the community, potentially accelerating innovations in the multi-modal and audio analysis domains of AI research. As the AI field progresses towards more holistic integrations of language and audio interactions, models like Qwen2-Audio set the foundational benchmarks for subsequent advancements.