Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-LLMs
The paper introduces Qwen-Audio, a large-scale audio-LLM aiming to enhance audio processing capabilities in the field of artificial intelligence. Qwen-Audio addresses a critical gap where existing instruction-following audio-LLMs are restricted by a limited range of tasks and audio types. The primary innovation of this work is the development of a comprehensive pre-training architecture encompassing over 30 varied tasks, enabling universal audio understanding.
Key Contributions
The Qwen-Audio model is built utilizing a single encoder for diverse audio inputs, including human speech, natural sounds, and music. It introduces a multi-task training framework to address interference issues stemming from the variation in dataset textual labels. This framework incorporates hierarchical tags to facilitate knowledge sharing among similar tasks while preventing detrimental interference caused by disparate text structures and annotation granularity.
Results indicate that Qwen-Audio consistently surpasses existing models across various benchmark tasks without requiring task-specific fine-tuning. Noteworthy achievements include superior performance in ASR benchmarks such as Librispeech and Aishell, Speech-to-Text Translation in the CoVoST2 dataset, and other diverse non-speech audio understanding tasks like Acoustic Scene Classification, Speech Emotion Recognition, and Audio Question Answering.
Building on Qwen-Audio, the authors introduce Qwen-Audio-Chat, enabling multi-turn dialogues and interactions in audio-centered contexts. This facilitates flexible input handling from varied audio and text inputs, thereby augmenting interactive capabilities with humans.
Methodology
- Architecture: Qwen-Audio employs a Whisper-based encoder paired with a Qwen-7B LLM, separating audio encoding from language processing to allow expansive task execution without additional architectural modifications.
- Multi-task Framework: A multitask training format incorporating transcription, language identification, task-specific tags, timestamp predictions, and output instructions. This approach ensures effective task execution while mitigating one-to-many interferences.
- Supervised Fine-tuning: Qwen-Audio-Chat is developed via instruction-based finetuning, promoting alignment with human dialogue and comprehension of intricate interactions using audio and text data.
Implications
The implications of Qwen-Audio are twofold. Practically, it paves the way for more robust audio-language integration in AI applications, expanding beyond traditional speech recognition into zones inclusive of complex auditory scene analyses and multi-modal interactions. Theoretically, it enhances understanding of how large-scale, multi-task learning frameworks can be leveraged to fuse distinct modalities, fostering innovations in cross-modal AI systems.
Future Directions
Future developments in AI may explore extending models like Qwen-Audio to handle even broader modal interactions, refining task integration while reducing interference further. The cross-pollination of audio processing with visual modalities could herald comprehensive multi-media models, expanding AI's interpretative capacity in real-world environments.