Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UniAudio: An Audio Foundation Model Toward Universal Audio Generation (2310.00704v6)

Published 1 Oct 2023 in cs.SD and eess.AS
UniAudio: An Audio Foundation Model Toward Universal Audio Generation

Abstract: LLMs (LLM) have demonstrated the capability to handle a variety of generative tasks. This paper presents the UniAudio system, which, unlike prior task-specific approaches, leverages LLM techniques to generate multiple types of audio (including speech, sounds, music, and singing) with given input conditions. UniAudio 1) first tokenizes all types of target audio along with other condition modalities, 2) concatenates source-target pair as a single sequence, and 3) performs next-token prediction using LLM. Also, a multi-scale Transformer model is proposed to handle the overly long sequences caused by the residual vector quantization based neural codec in tokenization. Training of UniAudio is scaled up to 165K hours of audio and 1B parameters, based on all generative tasks, aiming to obtain sufficient prior knowledge not only in the intrinsic properties of audio but also the inter-relationship between audio and other modalities. Therefore, the trained UniAudio model has the potential to become a foundation model for universal audio generation: it shows strong capability in all trained tasks and can seamlessly support new audio generation tasks after simple fine-tuning. Experiments demonstrate that UniAudio achieves state-of-the-art or at least competitive results on most of the 11 tasks. Demo and code are released at https://github.com/yangdongchao/UniAudio

Overview of UniAudio: A Universal Audio Foundation Model

The paper presents UniAudio, a model designed to achieve universal audio generation by leveraging techniques from LLMs. UniAudio uniquely positions itself within the generative AI landscape by enabling multi-modal audio generation tasks, including speech, sounds, music, and singing, under a unified framework. This model capitalizes on generative knowledge across diverse audio types, conditioned on inputs like phoneme sequences, textual descriptions, and other audio modalities.

Methodology

UniAudio's approach can be summarized through three key innovations:

  1. Universal Tokenization: All input modalities are tokenized into discrete sequences. A universal neural codec model achieves this by mapping different audio types into a shared latent space. The tokenization efficiency is ensured using residual vector quantization, albeit resulting in long token sequences that are efficiently managed using a multi-scale Transformer architecture.
  2. Multi-scale Transformer Architecture: To manage these lengthy token sequences, UniAudio implements a global-local Transformer mechanism. The global Transformer addresses inter-frame correlations, while the local Transformer handles intra-frame dependencies, optimizing both computational complexity and sequence processing.
  3. Unified Task Formulation: The model is structured to process multiple audio generation tasks by concatenating tokenized source and target sequences. This uniform formulation facilitates comprehensive handling across various tasks, enabling efficient model training and inference.

Experiments and Results

UniAudio was trained on a substantial collection of 165,000 hours of audio data, covering a wide array of tasks, and extending to 1 billion parameters. The model was evaluated against 11 tasks across training and fine-tuning stages, demonstrating state-of-the-art or competitive results. Notably, UniAudio exhibited mutual task benefits through its joint training approach, which facilitated enhanced modeling performances compared to task-specific models.

Key experimental outcomes highlighted:

  • Text-to-Speech and Voice Conversion: UniAudio achieved superior or comparable results in terms of Word Error Rate (WER) and speaker similarity scores compared to other advanced models like VALL-E and NaturalSpeech 2.
  • Speech and Target Speaker Extraction: While achieving high DNSMOS scores, UniAudio's generative techniques outperformed traditional signal-level metrics, highlighting LLM-based generation's advantages distinct from conventional methods.
  • Text-to-Sound and Text-to-Music Generation: The model exhibited competitive performance on newly introduced tasks during the fine-tuning phase, demonstrating its scalability and adaptability.

Implications and Future Research

The development of UniAudio suggests significant implications for future audio generation models. It illustrates a paradigm shift towards universal foundations, where seamless support for emerging audio generation needs is possible through fine-tuning. The unified approach fosters broader and more efficient data utilization, offering valuable insights into jointly training diverse tasks under a singular framework.

Looking ahead, the potential extensions of UniAudio could explore incorporating unlabeled data to enhance learning robustness, scaling the foundation model across even more diverse audio generation tasks, and possibly integrating with domain-specific models to fine-grain application-specific enhancements.

In conclusion, UniAudio represents a pivotal advance in audio foundation models, showcasing the power and potential of LLM techniques extended beyond text to the evolving domain of audio. Its release, along with accompanying code and demonstrations, aims to seed further advancements in universal audio generation research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
  2. Seamlessm4t-massively multilingual & multimodal machine translation. arXiv preprint arXiv:2308.11596, 2023.
  3. Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  4. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning, pp. 2709–2720. PMLR, 2022.
  5. Fullsubnet+: Channel attention fullsubnet with complex spectrograms for speech enhancement. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  7857–7861. IEEE, 2022.
  6. A survey on recent deep learning-driven singing voice synthesis systems. In 2021 IEEE International Conference on Artificial Intelligence and Virtual Reality (AIVR), pp.  319–323. IEEE, 2021.
  7. Unsupervised cross-lingual representation learning for speech recognition. arXiv preprint arXiv:2006.13979, 2020.
  8. Simple and controllable music generation. arXiv preprint arXiv:2306.05284, 2023.
  9. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
  10. Clotho: An audio captioning dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  736–740. IEEE, 2020.
  11. Tokensplit: Using discrete speech representations for direct, refined, and transcript-conditioned speech separation and recognition. arXiv preprint arXiv:2308.10415, 2023.
  12. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.  776–780. IEEE, 2017.
  13. Signal estimation from modified short-time fourier transform. IEEE Transactions on acoustics, speech, and signal processing, 32(2):236–243, 1984.
  14. Complex neural spatial filter: Enhancing multi-channel target speech separation in complex domain. IEEE Signal Processing Letters, 28:1370–1374, 2021.
  15. Prompttts: Controllable text-to-speech with text descriptions. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5. IEEE, 2023.
  16. Fullsubnet: A full-band and sub-band fusion model for real-time single-channel speech enhancement. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  6633–6637. IEEE, 2021.
  17. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine, 29(6):82–97, 2012. doi: 10.1109/MSP.2012.2205597.
  18. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  19. Multi-singer: Fast multi-singer singing voice vocoder with a large-scale corpus. In Proceedings of the 29th ACM International Conference on Multimedia, pp.  3945–3954, 2021a.
  20. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint arXiv:2301.12661, 2023a.
  21. Make-a-voice: Unified voice synthesis with discrete representation. arXiv preprint arXiv:2305.19269, 2023b.
  22. How far are we from robust voice conversion: A survey. In 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 514–521. IEEE, 2021b.
  23. Mega-tts: Zero-shot text-to-speech at scale with intrinsic inductive bias. arXiv preprint arXiv:2306.03509, 2023.
  24. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  7669–7673. IEEE, 2020.
  25. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. arXiv preprint arXiv:2302.03540, 2023.
  26. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  119–132, 2019.
  27. A study on data augmentation of reverberant speech for robust speech recognition. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  5220–5224. IEEE, 2017.
  28. Audiogen: Textually guided audio generation. arXiv preprint arXiv:2209.15352, 2022.
  29. High-fidelity audio compression with improved RVQGAN. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=qjnl1QUnFA.
  30. Voicebox: Text-guided multilingual universal speech generation at scale. 2023. URL https://dl.fbaipublicfiles.com/voicebox/paper.pdf.
  31. Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023a.
  32. AudioLDM 2: Learning holistic audio generation with self-supervised pretraining. arXiv preprint arXiv:2308.05734, 2023b.
  33. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pp.  11020–11028, 2022.
  34. Any-to-many voice conversion with location-relative sequence-to-sequence modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:1717–1728, 2021.
  35. Conditional diffusion probabilistic model for speech enhancement. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  7402–7406. IEEE, 2022.
  36. Yi Luo and Nima Mesgarani. Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing, 27(8):1256–1266, 2019.
  37. The million song dataset challenge. In Proceedings of the 21st International Conference on World Wide Web, pp.  909–916, 2012.
  38. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. arXiv preprint arXiv:2303.17395, 2023.
  39. Dcase 2017 challenge setup: Tasks, datasets and baseline system. In DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events, 2017.
  40. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
  41. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2204.06125, 2023.
  42. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.  5206–5210. IEEE, 2015.
  43. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019.
  44. Mls: A large-scale multilingual dataset for speech research. arXiv preprint arXiv:2012.03411, 2020.
  45. Improving language understanding by generative pre-training. 2018.
  46. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  47. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  48. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558, 2020.
  49. Speech enhancement and dereverberation with diffusion-based generative models. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  50. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  51. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925, 2023.
  52. Prospects for articulatory synthesis: A position paper. In 4th ISCA tutorial and research workshop (ITRW) on speech synthesis, 2001.
  53. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116, 2023.
  54. Aishell-3: A multi-speaker mandarin tts corpus and the baselines. arXiv preprint arXiv:2010.11567, 2020.
  55. Editts: Score-based editing for controllable text-to-speech. arXiv preprint arXiv:2110.02584, 2021.
  56. A survey on neural speech synthesis. arXiv preprint arXiv:2106.15561, 2021.
  57. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  58. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. University of Edinburgh. The Centre for Speech Technology Research (CSTR), 6:15, 2017.
  59. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023a.
  60. Voicefilter: Targeted voice separation by speaker-conditioned spectrogram masking. arXiv preprint arXiv:1810.04826, 2018.
  61. Nadiffuse: Noise-aware diffusion-based model for speech enhancement. arXiv preprint arXiv:2309.01212, 2023b.
  62. First step towards end-to-end parametric tts synthesis: Generating spectral parameters with neural attention. In Interspeech, pp.  2243–2247, 2016.
  63. Speechx: Neural codec language model as a versatile speech transformer. arXiv preprint arXiv:2308.06873, 2023c.
  64. Opencpop: A high-quality open source chinese popular song corpus for singing voice synthesis. arXiv preprint arXiv:2201.07429, 2022.
  65. Audit: Audio editing by following instructions with latent diffusion models. arXiv preprint arXiv:2304.00830, 2023d.
  66. Lm-vc: Zero-shot voice conversion via speech generation based on language models. arXiv preprint arXiv:2306.10521, 2023e.
  67. A reverberation-time-aware approach to speech dereverberation based on deep neural networks. IEEE/ACM transactions on audio, speech, and language processing, 25(1):102–111, 2016.
  68. Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt. arXiv preprint arXiv:2301.13662, 2023a.
  69. Hifi-codec: Group-residual vector quantization for high fidelity audio codec. arXiv preprint arXiv:2305.02765, 2023b.
  70. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023c.
  71. Permutation invariant training of deep models for speaker-independent multi-talker speech separation. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  241–245. IEEE, 2017.
  72. Megabyte: Predicting million-byte sequences with multiscale transformers. arXiv preprint arXiv:2305.07185, 2023.
  73. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
  74. Statistical parametric speech synthesis using deep neural networks. In 2013 ieee international conference on acoustics, speech and signal processing, pp.  7962–7966. IEEE, 2013.
  75. Libritts: A corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882, 2019.
  76. M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus. Advances in Neural Information Processing Systems, 35:6914–6926, 2022.
  77. Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. arXiv preprint arXiv:2303.03926, 2023.
  78. Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures. IEEE Journal of Selected Topics in Signal Processing, 13(4):800–814, 2019.
  79. Neural target speech extraction: An overview. IEEE Signal Processing Magazine, 40(3):8–29, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Dongchao Yang (51 papers)
  2. Jinchuan Tian (33 papers)
  3. Xu Tan (164 papers)
  4. Rongjie Huang (62 papers)
  5. Songxiang Liu (28 papers)
  6. Xuankai Chang (61 papers)
  7. Jiatong Shi (82 papers)
  8. Sheng Zhao (75 papers)
  9. Jiang Bian (229 papers)
  10. Xixin Wu (85 papers)
  11. Zhou Zhao (218 papers)
  12. Helen Meng (204 papers)
Citations (90)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com