Masked Audio Generation using a Single Non-Autoregressive Transformer (2401.04577v2)
Abstract: We introduce MAGNeT, a masked generative sequence modeling method that operates directly over several streams of audio tokens. Unlike prior work, MAGNeT is comprised of a single-stage, non-autoregressive transformer. During training, we predict spans of masked tokens obtained from a masking scheduler, while during inference we gradually construct the output sequence using several decoding steps. To further enhance the quality of the generated audio, we introduce a novel rescoring method in which, we leverage an external pre-trained model to rescore and rank predictions from MAGNeT, which will be then used for later decoding steps. Lastly, we explore a hybrid version of MAGNeT, in which we fuse between autoregressive and non-autoregressive models to generate the first few seconds in an autoregressive manner while the rest of the sequence is being decoded in parallel. We demonstrate the efficiency of MAGNeT for the task of text-to-music and text-to-audio generation and conduct an extensive empirical evaluation, considering both objective metrics and human studies. The proposed approach is comparable to the evaluated baselines, while being significantly faster (x7 faster than the autoregressive baseline). Through ablation studies and analysis, we shed light on the importance of each of the components comprising MAGNeT, together with pointing to the trade-offs between autoregressive and non-autoregressive modeling, considering latency, throughput, and generation quality. Samples are available on our demo page https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT.
- Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
- Automatic speech recognition: A deep learning approach, 2008.
- Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023a.
- Soundstorm: Efficient parallel audio generation. arXiv preprint arXiv:2305.09636, 2023b.
- Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
- Simple and controllable music generation. arXiv preprint arXiv:2306.05284, 2023.
- FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
- Learning-rate-free learning by d-adaptation. arXiv preprint arXiv:2301.07733, 2023.
- Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020.
- Singsong: Generating musical accompaniments from singing. arXiv preprint arXiv:2301.12662, 2023.
- High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
- SÂ Forsgren and HÂ Martiros. Riffusion-stable diffusion for real-time music generation. 2022. URL https://riffusion. com/about, 2022.
- Foley music: Learning to generate music from videos. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16. Springer, 2020.
- Vampnet: Music generation via masked acoustic token modeling. arXiv preprint arXiv:2307.04686, 2023.
- Augmentation invariant discrete representation for generative spoken language modeling. In IWSLT, 2023.
- Mask-predict: Parallel decoding of conditional masked language models. arXiv preprint arXiv:1904.09324, 2019.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
- The curious case of neural text degeneration, 2020.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.
- Mulan: A joint embedding of music audio and natural language. arXiv preprint arXiv:2208.12415, 2022.
- Noise2music: Text-conditioned music generation with diffusion models. arXiv preprint arXiv:2302.03917, 2023a.
- Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint arXiv:2301.12661, 2023b.
- Text-free prosody-aware generative spoken language modeling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022.
- Fréchet audio distance: A metric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466, 2018.
- Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019.
- Efficient training of audio transformers with patchout. arXiv preprint arXiv:2110.05069, 2021.
- Audiogen: Textually guided audio generation. arXiv preprint arXiv:2209.15352, 2022a.
- Audio language modeling using perceptually-guided discrete representations. arXiv preprint arXiv:2211.01223, 2022b.
- On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9, 2021.
- Efficient neural music generation. arXiv preprint arXiv:2305.15719, 2023.
- Bigvgan: A universal neural vocoder with large-scale training. arXiv preprint arXiv:2206.04658, 2022.
- xformers: A modular and hackable transformer modelling library. https://github.com/facebookresearch/xformers, 2022.
- Improved masked image generation with token-critic. In European Conference on Computer Vision. Springer, 2022.
- Jen-1: Text-guided universal music generation with omnidirectional diffusion models. arXiv preprint arXiv:2308.04729, 2023.
- Rethinking evaluation in asr: Are our models robust enough? arXiv preprint arXiv:2010.11745, 2020.
- Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023a.
- Audioldm 2: Learning holistic audio generation with self-supervised pretraining. arXiv preprint arXiv:2308.05734, 2023b.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Speaking style conversion with discrete self-supervised units. arXiv preprint arXiv:2212.09730, 2022.
- Kinyugo Maina. Msanii: High fidelity music synthesis on a shoestring budget. arXiv preprint arXiv:2301.06468, 2023.
- Speech resynthesis from discrete disentangled self-supervised representations. arXiv preprint arXiv:2104.00355, 2021.
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 2021.
- Do transformers need deep long-range memory. arXiv preprint arXiv:2007.03356, 2020.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020.
- Crowdmos: An approach for crowdsourcing mean opinion score studies. In IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2011.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 2021.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Clip-sculptor: Zero-shot generation of high-fidelity and diverse shapes from natural language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Mo\\\backslash\^ usai: Text-to-music generation with long-context latent diffusion. arXiv preprint arXiv:2301.11757, 2023.
- I hear your true colors: Image guided audio generation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023.
- Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023.
- Diffsound: Discrete diffusion model for text-to-sound generation. arXiv preprint arXiv:2207.09983, 2022.
- Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.
- Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. arXiv preprint arXiv:2303.03926, 2023.