Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models (2311.07919v2)

Published 14 Nov 2023 in eess.AS, cs.CL, and cs.LG

Abstract: Recently, instruction-following audio-LLMs have received broad attention for audio interaction with humans. However, the absence of pre-trained audio models capable of handling diverse audio types and tasks has hindered progress in this field. Consequently, most existing works have only been able to support a limited range of interaction capabilities. In this paper, we develop the Qwen-Audio model and address this limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types, such as human speech, natural sounds, music, and songs, to facilitate universal audio understanding abilities. However, directly co-training all tasks and datasets can lead to interference issues, as the textual labels associated with different datasets exhibit considerable variations due to differences in task focus, language, granularity of annotation, and text structure. To overcome the one-to-many interference, we carefully design a multi-task training framework by conditioning on a sequence of hierarchical tags to the decoder for encouraging knowledge sharing and avoiding interference through shared and specified tags respectively. Remarkably, Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Building upon the capabilities of Qwen-Audio, we further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.

PDF HTML Abstract

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-LLMs

The paper introduces Qwen-Audio, a large-scale audio-LLM aiming to enhance audio processing capabilities in the field of artificial intelligence. Qwen-Audio addresses a critical gap where existing instruction-following audio-LLMs are restricted by a limited range of tasks and audio types. The primary innovation of this work is the development of a comprehensive pre-training architecture encompassing over 30 varied tasks, enabling universal audio understanding.

Key Contributions

The Qwen-Audio model is built utilizing a single encoder for diverse audio inputs, including human speech, natural sounds, and music. It introduces a multi-task training framework to address interference issues stemming from the variation in dataset textual labels. This framework incorporates hierarchical tags to facilitate knowledge sharing among similar tasks while preventing detrimental interference caused by disparate text structures and annotation granularity.

Results indicate that Qwen-Audio consistently surpasses existing models across various benchmark tasks without requiring task-specific fine-tuning. Noteworthy achievements include superior performance in ASR benchmarks such as Librispeech and Aishell, Speech-to-Text Translation in the CoVoST2 dataset, and other diverse non-speech audio understanding tasks like Acoustic Scene Classification, Speech Emotion Recognition, and Audio Question Answering.

Building on Qwen-Audio, the authors introduce Qwen-Audio-Chat, enabling multi-turn dialogues and interactions in audio-centered contexts. This facilitates flexible input handling from varied audio and text inputs, thereby augmenting interactive capabilities with humans.

Methodology

Architecture: Qwen-Audio employs a Whisper-based encoder paired with a Qwen-7B LLM, separating audio encoding from language processing to allow expansive task execution without additional architectural modifications.
Multi-task Framework: A multitask training format incorporating transcription, language identification, task-specific tags, timestamp predictions, and output instructions. This approach ensures effective task execution while mitigating one-to-many interferences.
Supervised Fine-tuning: Qwen-Audio-Chat is developed via instruction-based finetuning, promoting alignment with human dialogue and comprehension of intricate interactions using audio and text data.

Implications

The implications of Qwen-Audio are twofold. Practically, it paves the way for more robust audio-language integration in AI applications, expanding beyond traditional speech recognition into zones inclusive of complex auditory scene analyses and multi-modal interactions. Theoretically, it enhances understanding of how large-scale, multi-task learning frameworks can be leveraged to fuse distinct modalities, fostering innovations in cross-modal AI systems.

Future Directions

Future developments in AI may explore extending models like Qwen-Audio to handle even broader modal interactions, refining task integration while reducing interference further. The cross-pollination of audio processing with visual modalities could herald comprehensive multi-media models, expanding AI's interpretative capacity in real-world environments.

PDF Markdown Bookmark Chat (Pro)

References (71)

Authors (8)

Yunfei Chu (15 papers)
Jin Xu (131 papers)
Xiaohuan Zhou (13 papers)
Qian Yang (146 papers)
Shiliang Zhang (132 papers)
Zhijie Yan (33 papers)
Chang Zhou (105 papers)
Jingren Zhou (198 papers)

Citations (180)

View on Semantic Scholar