Audiobox: Unified Audio Generation with Natural Language Prompts (2312.15821v1)
Abstract: Audio is an essential part of our life, but creating it often requires expertise and is time-consuming. Research communities have made great progress over the past year advancing the performance of large scale audio generative models for a single modality (speech, sound, or music) through adopting more powerful generative models and scaling data. However, these models lack controllability in several aspects: speech generation models cannot synthesize novel styles based on text description and are limited on domain coverage such as outdoor environments; sound generation models only provide coarse-grained control based on descriptions like "a person speaking" and would only generate mumbling human voices. This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities. We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms. We allow transcript, vocal, and other audio styles to be controlled independently when generating speech. To improve model generalization with limited labels, we adapt a self-supervised infilling objective to pre-train on large quantities of unlabeled audio. Audiobox sets new benchmarks on speech and sound generation (0.745 similarity on Librispeech for zero-shot TTS; 0.77 FAD on AudioCaps for text-to-sound) and unlocks new methods for generating audio with novel vocal and acoustic styles. We further integrate Bespoke Solvers, which speeds up generation by over 25 times compared to the default ODE solver for flow-matching, without loss of performance on several tasks. Our demo is available at https://audiobox.metademolab.com/
- Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
- Common voice: A massively-multilingual speech corpus. In International Conference on Language Resources and Evaluation, 2019.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 2020.
- Soundstorm: Efficient parallel audio generation. arXiv preprint arXiv:2305.09636, 2023.
- H. Bredin. pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. In Proc. INTERSPEECH 2023, 2023.
- Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2022a.
- R. T. Q. Chen. torchdiffeq, 2018. URL https://github.com/rtqichen/torchdiffeq.
- Neural ordinary differential equations. In Neural Information Processing Systems, 2018.
- Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022b.
- Fisher English training speech parts 1 and 2 LDC200{4,5}S13. Web Download. Linguistic Data Consortium, Philadelphia, 2004,2005a.
- Fisher English training speech parts 1 and 2 transcripts LDC200{4,5}T19. Web Download. Linguistic Data Consortium, Philadelphia, 2004,2005b.
- The spotify podcast dataset. arXiv preprint arXiv:2004.04270, 2020.
- Simple and controllable music generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- High fidelity neural audio compression. ArXiv, abs/2210.13438, 2022.
- Adversarial audio synthesis. arXiv preprint arXiv:1802.04208, 2018.
- High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
- The stable signature: Rooting watermarks in latent diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22466–22477, October 2023.
- Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017.
- Text-to-audio generation using instruction-tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731, 2023.
- Switchboard: Telephone speech corpus for research and development. In Acoustics, Speech, and Signal Processing, IEEE International Conference on, volume 1, pages 517–520. IEEE Computer Society, 1992.
- PromptTTS: Controllable text-to-speech with text descriptions. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
- J. Ho and T. Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 2020.
- Hierarchical generative modeling for controllable speech synthesis. In International Conference on Learning Representations, 2019.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
- Lora: Low-rank adaptation of large language models, 2021.
- Music transformer. arXiv preprint arXiv:1809.04281, 2018.
- Noise2music: Text-conditioned music generation with diffusion models. arXiv preprint arXiv:2302.03917, 2023a.
- Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint arXiv:2301.12661, 2023b.
- Libri-Light: A benchmark for asr with limited or no supervision. International Conference on Acoustics, Speech and Signal Processing, 2019.
- Speak, read and prompt: High-fidelity text-to-speech with minimal supervision, 2023.
- Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms. In Interspeech, 2019.
- Audiocaps: Generating captions for audios in the wild. In NAACL-HLT, 2019.
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
- D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.
- Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033, 2020.
- Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 2019.
- Audiogen: Textually guided audio generation. arXiv preprint arXiv:2209.15352, 2022.
- Improving transfer of expressivity for end-to-end multispeaker text-to-speech synthesis. In 2021 29th European Signal Processing Conference (EUSIPCO), pages 31–35. IEEE, 2021.
- Torchaudio-squim: Reference-less speech quality and intelligibility measures in torchaudio. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. 10.1109/ICASSP49357.2023.10096680.
- A. Łańcucki. Fastpitch: Parallel text-to-speech with pitch prediction. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6588–6592. IEEE, 2021.
- Voicebox: Text-guided multilingual universal speech generation at scale, 2023.
- Bigvgan: A universal neural vocoder with large-scale training. arXiv preprint arXiv:2206.04658, 2022.
- Voiceldm: Text-to-speech with environmental context, 2023.
- PromptTTS 2: Describing and generating voices with text prompt. arXiv preprint arXiv:2309.02285, 2023.
- Jen-1: Text-guided universal music generation with omnidirectional diffusion models. arXiv preprint arXiv:2308.04729, 2023.
- Flow matching for generative modeling. In International Conference on Learning Representations, 2023.
- Generative pre-training for speech with flow matching, 2023a.
- Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023b.
- Audioldm 2: Learning holistic audio generation with self-supervised pretraining. arXiv preprint arXiv:2308.05734, 2023c.
- Separate anything you describe. arXiv preprint arXiv:2308.05037, 2023d.
- Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019. URL https://api.semanticscholar.org/CorpusID:198953378.
- Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, 2017.
- Expresso: A benchmark and analysis of discrete expressive speech resynthesis. arXiv preprint arXiv:2308.05725, 2023.
- Librispeech: An asr corpus based on public domain audio books. International Conference on Acoustics, Speech and Signal Processing, 2015.
- A. Plaquet and H. Bredin. Powerset multi-class cross entropy loss for neural speaker diarization. In Proc. INTERSPEECH 2023, 2023.
- Mls: A large-scale multilingual dataset for speech research. ArXiv, abs/2012.03411, 2020.
- Train short, test long: Attention with linear biases enables input length extrapolation. ArXiv, abs/2108.12409, 2021.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021. URL https://api.semanticscholar.org/CorpusID:231591445.
- Robust speech recognition via large-scale weak supervision, 2022.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- Fastspeech 2: Fast and high-quality end-to-end text to speech. In International Conference on Learning Representations, 2021.
- Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), volume 2, pages 749–752 vol.2, 2001. 10.1109/ICASSP.2001.941023.
- High-resolution image synthesis with latent diffusion models, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
- Improved techniques for training gans. ArXiv, abs/1606.03498, 2016. URL https://api.semanticscholar.org/CorpusID:1687220.
- Mo\\\backslash\^ usai: Text-to-music generation with long-context latent diffusion. arXiv preprint arXiv:2301.11757, 2023.
- Seamless Communication. Seamless: Multilingual expressive and streaming speech translation. 2023.
- Bespoke solvers for generative flow models. arXiv preprint arXiv:2310.19075, 2023.
- Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. International Conference on Acoustics, Speech and Signal Processing, 2017.
- Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116, 2023.
- Rad-tts: Parallel flow-based tts with robust alignment learning and diverse synthesis. In ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2021.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Attention is all you need. ArXiv, abs/1706.03762, 2017.
- Neural codec language models are zero-shot text to speech synthesizers. ArXiv, abs/2301.02111, 2023a.
- Audit: Audio editing by following instructions with latent diffusion models. arXiv preprint arXiv:2304.00830, 2023b.
- Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
- Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92). 2019.
- InstructTTS: Modelling expressive tts in discrete latent space with natural language style prompt. arXiv preprint arXiv:2301.13662, 2023a.
- Uniaudio: An audio foundation model toward universal audio generation. arXiv preprint arXiv:2310.00704, 2023b.
- Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023c.
- Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2022.
- Libritts: A corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882, 2019.
- Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. arXiv preprint arXiv:2303.03926, 2023.
- Apoorv Vyas (15 papers)
- Bowen Shi (82 papers)
- Matthew Le (7 papers)
- Andros Tjandra (39 papers)
- Yi-Chiao Wu (42 papers)
- Baishan Guo (6 papers)
- Jiemin Zhang (1 paper)
- Xinyue Zhang (63 papers)
- Robert Adkins (1 paper)
- William Ngan (5 papers)
- Jeff Wang (11 papers)
- Ivan Cruz (1 paper)
- Bapi Akula (3 papers)
- Akinniyi Akinyemi (1 paper)
- Brian Ellis (5 papers)
- Rashel Moritz (4 papers)
- Yael Yungster (2 papers)
- Alice Rakotoarison (3 papers)
- Liang Tan (22 papers)
- Chris Summers (2 papers)