Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models (2311.07919v2)

Published 14 Nov 2023 in eess.AS, cs.CL, and cs.LG

Abstract: Recently, instruction-following audio-LLMs have received broad attention for audio interaction with humans. However, the absence of pre-trained audio models capable of handling diverse audio types and tasks has hindered progress in this field. Consequently, most existing works have only been able to support a limited range of interaction capabilities. In this paper, we develop the Qwen-Audio model and address this limitation by scaling up audio-language pre-training to cover over 30 tasks and various audio types, such as human speech, natural sounds, music, and songs, to facilitate universal audio understanding abilities. However, directly co-training all tasks and datasets can lead to interference issues, as the textual labels associated with different datasets exhibit considerable variations due to differences in task focus, language, granularity of annotation, and text structure. To overcome the one-to-many interference, we carefully design a multi-task training framework by conditioning on a sequence of hierarchical tags to the decoder for encouraging knowledge sharing and avoiding interference through shared and specified tags respectively. Remarkably, Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Building upon the capabilities of Qwen-Audio, we further develop Qwen-Audio-Chat, which allows for input from various audios and text inputs, enabling multi-turn dialogues and supporting various audio-central scenarios.

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-LLMs

The paper introduces Qwen-Audio, a large-scale audio-LLM aiming to enhance audio processing capabilities in the field of artificial intelligence. Qwen-Audio addresses a critical gap where existing instruction-following audio-LLMs are restricted by a limited range of tasks and audio types. The primary innovation of this work is the development of a comprehensive pre-training architecture encompassing over 30 varied tasks, enabling universal audio understanding.

Key Contributions

The Qwen-Audio model is built utilizing a single encoder for diverse audio inputs, including human speech, natural sounds, and music. It introduces a multi-task training framework to address interference issues stemming from the variation in dataset textual labels. This framework incorporates hierarchical tags to facilitate knowledge sharing among similar tasks while preventing detrimental interference caused by disparate text structures and annotation granularity.

Results indicate that Qwen-Audio consistently surpasses existing models across various benchmark tasks without requiring task-specific fine-tuning. Noteworthy achievements include superior performance in ASR benchmarks such as Librispeech and Aishell, Speech-to-Text Translation in the CoVoST2 dataset, and other diverse non-speech audio understanding tasks like Acoustic Scene Classification, Speech Emotion Recognition, and Audio Question Answering.

Building on Qwen-Audio, the authors introduce Qwen-Audio-Chat, enabling multi-turn dialogues and interactions in audio-centered contexts. This facilitates flexible input handling from varied audio and text inputs, thereby augmenting interactive capabilities with humans.

Methodology

  1. Architecture: Qwen-Audio employs a Whisper-based encoder paired with a Qwen-7B LLM, separating audio encoding from language processing to allow expansive task execution without additional architectural modifications.
  2. Multi-task Framework: A multitask training format incorporating transcription, language identification, task-specific tags, timestamp predictions, and output instructions. This approach ensures effective task execution while mitigating one-to-many interferences.
  3. Supervised Fine-tuning: Qwen-Audio-Chat is developed via instruction-based finetuning, promoting alignment with human dialogue and comprehension of intricate interactions using audio and text data.

Implications

The implications of Qwen-Audio are twofold. Practically, it paves the way for more robust audio-language integration in AI applications, expanding beyond traditional speech recognition into zones inclusive of complex auditory scene analyses and multi-modal interactions. Theoretically, it enhances understanding of how large-scale, multi-task learning frameworks can be leveraged to fuse distinct modalities, fostering innovations in cross-modal AI systems.

Future Directions

Future developments in AI may explore extending models like Qwen-Audio to handle even broader modal interactions, refining task integration while reducing interference further. The cross-pollination of audio processing with visual modalities could herald comprehensive multi-media models, expanding AI's interpretative capacity in real-world environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
  2. Spice: Semantic propositional image caption evaluation. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14. Springer, 2016.
  3. PaLM 2 technical report. arXiv:2305.10403, 2023.
  4. Anonymous. SALMONN: Towards generic hearing abilities for large language models. In Submitted to The Twelfth International Conference on Learning Representations, 2023. under review.
  5. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. arXiv:2110.07205, 2021.
  6. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
  7. Qwen-VL: A frontier large vision-language model with versatile abilities. CoRR, abs/2308.12966, 2023b.
  8. Language models are few-shot learners. NeurIPS, 2020.
  9. AISHELL-1: an open-source mandarin speech corpus and a speech recognition baseline. In 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, O-COCOSDA 2017, Seoul, South Korea, November 1-3, 2017. IEEE, 2017.
  10. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv:2306.15195, 2023.
  11. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process., 2022.
  12. Speechnet: A universal modularized model for speech processing tasks. arXiv:2105.03070, 2021.
  13. PaLM: Scaling language modeling with pathways. arXiv:2204.02311, 2022.
  14. High fidelity neural audio compression. arXiv:2210.13438, 2022.
  15. Pengi: An audio language model for audio tasks. CoRR, 2023.
  16. Clotho: an audio captioning dataset. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2020, Barcelona, Spain, May 4-8, 2020. IEEE, 2020.
  17. AISHELL-2: transforming mandarin ASR research into industrial scale. abs/1808.10583, 2018.
  18. CLAP: learning audio concepts from natural language supervision. abs/2206.04769, 2022.
  19. Neural audio synthesis of musical notes with wavenet autoencoders. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, Proceedings of Machine Learning Research. PMLR, 2017.
  20. Funasr: A fundamental end-to-end speech recognition toolkit. CoRR, abs/2305.11013, 2023.
  21. Vocalsound: A dataset for improving human vocal sounds recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022, pages 151–155. IEEE, 2022. doi: 10.1109/ICASSP43922.2022.9746828. URL https://doi.org/10.1109/ICASSP43922.2022.9746828.
  22. Whisper-at: Noise-robust automatic speech recognizers are also strong general audio event taggers. CoRR, abs/2307.03183, 2023a.
  23. Listen, think, and understand. CoRR, abs/2305.10790, 2023b.
  24. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE ACM Trans. Audio Speech Lang. Process., 2021.
  25. Lora: Low-rank adaptation of large language models. arXiv:2106.09685, 2021.
  26. Audiogpt: Understanding and generating speech, music, sound, and talking head. CoRR, abs/2304.12995, 2023.
  27. Cochlscene: Acquisition of acoustic scene data using crowdsourcing. abs/2211.02289, 2022.
  28. Voicebox: Text-guided multilingual universal speech generation at scale. CoRR, 2023.
  29. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022.
  30. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, Proceedings of Machine Learning Research. PMLR, 2023.
  31. Clotho-aqa: A crowdsourced dataset for audio question answering. In 30th European Signal Processing Conference, EUSIPCO 2022, Belgrade, Serbia, August 29 - Sept. 2, 2022. IEEE, 2022.
  32. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. CoRR, abs/2306.09093, 2023.
  33. Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks. arXiv:2309.07937, 2023.
  34. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 20-24, 2017, 2017.
  35. DCASE2017 challenge setup: Tasks, datasets and baseline system. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events, DCASE 2017, Munich, Germany, November 16-17, 2017, 2017.
  36. Lms with a voice: Spoken language modeling beyond speech tokens. CoRR, 2023.
  37. Openai. Chatml documents. URL https://github.com/openai/openai-python/blob/main/chatml.md.
  38. OpenAI. Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.
  39. OpenAI. Gpt-4 technical report, 2023.
  40. Training language models to follow instructions with human feedback. NeurIPS, 2022.
  41. Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015. IEEE, 2015.
  42. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002.
  43. Specaugment: A simple data augmentation method for automatic speech recognition. In Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019.
  44. Kosmos-2: Grounding multimodal large language models to the world. arXiv:2306.14824, 2023.
  45. MELD: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers. Association for Computational Linguistics, 2019.
  46. Qwen. Introducing qwen-7b: Open foundation and human-aligned models (of the state-of-the-arts), 2023. URL https://github.com/QwenLM/Qwen-7B.
  47. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, 2023.
  48. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020.
  49. Audiopalm: A large language model that can speak and listen. CoRR.
  50. Hugginggpt: Solving AI tasks with chatgpt and its friends in huggingface. CoRR, abs/2303.17580, 2023.
  51. Achieving timestamp prediction while recognizing with non-autoregressive end-to-end asr model. In National Conference on Man-Machine Speech Communication. Springer, 2023.
  52. Llasm: Large language and speech model. arXiv:2308.15930, 2023.
  53. Generative pretraining in multimodality. arXiv:2307.05222, 2023.
  54. LLaMA: Open and efficient foundation language models. arXiv:2302.13971, 2023a.
  55. Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023b.
  56. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023c.
  57. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 2017.
  58. Cider: Consensus-based image description evaluation. In CVPR, 2015.
  59. Covost 2: A massively multilingual speech-to-text translation corpus. abs/2007.10310, 2020. URL https://arxiv.org/abs/2007.10310.
  60. Blsp: Bootstrapping language-speech pre-training via behavior alignment of continuation writing. arXiv:2309.00916, 2023a.
  61. SLM: bridge the thin gap between speech and text foundation models. abs/2310.00230, 2023b.
  62. Slm: Bridge the thin gap between speech and text foundation models. arXiv:2310.00230, 2023c.
  63. Viola: Unified codec language models for speech recognition, synthesis, and translation. CoRR, 2023d.
  64. Speechx: Neural codec language model as a versatile speech transformer. CoRR, 2023e.
  65. On decoder-only architecture for speech-to-text and large language model integration. abs/2307.03917, 2023a.
  66. Next-gpt: Any-to-any multimodal LLM. CoRR, abs/2309.05519, 2023b.
  67. Soundstream: An end-to-end neural audio codec. IEEE ACM Trans. Audio Speech Lang. Process., 2022.
  68. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. CoRR, abs/2305.11000, 2023a.
  69. Speechtokenizer: Unified speech tokenizer for speech large language models. CoRR, abs/2308.16692, 2023b.
  70. Google usm: Scaling automatic speech recognition beyond 100 languages. CoRR, 2023c.
  71. Mmspeech: Multi-modal multi-task encoder-decoder pre-training for speech recognition. abs/2212.00500, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yunfei Chu (15 papers)
  2. Jin Xu (131 papers)
  3. Xiaohuan Zhou (13 papers)
  4. Qian Yang (146 papers)
  5. Shiliang Zhang (132 papers)
  6. Zhijie Yan (33 papers)
  7. Chang Zhou (105 papers)
  8. Jingren Zhou (198 papers)
Citations (180)