Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Masked Audio Generation using a Single Non-Autoregressive Transformer (2401.04577v2)

Published 9 Jan 2024 in cs.SD, cs.AI, cs.LG, and eess.AS

Abstract: We introduce MAGNeT, a masked generative sequence modeling method that operates directly over several streams of audio tokens. Unlike prior work, MAGNeT is comprised of a single-stage, non-autoregressive transformer. During training, we predict spans of masked tokens obtained from a masking scheduler, while during inference we gradually construct the output sequence using several decoding steps. To further enhance the quality of the generated audio, we introduce a novel rescoring method in which, we leverage an external pre-trained model to rescore and rank predictions from MAGNeT, which will be then used for later decoding steps. Lastly, we explore a hybrid version of MAGNeT, in which we fuse between autoregressive and non-autoregressive models to generate the first few seconds in an autoregressive manner while the rest of the sequence is being decoded in parallel. We demonstrate the efficiency of MAGNeT for the task of text-to-music and text-to-audio generation and conduct an extensive empirical evaluation, considering both objective metrics and human studies. The proposed approach is comparable to the evaluated baselines, while being significantly faster (x7 faster than the autoregressive baseline). Through ablation studies and analysis, we shed light on the importance of each of the components comprising MAGNeT, together with pointing to the trade-offs between autoregressive and non-autoregressive modeling, considering latency, throughput, and generation quality. Samples are available on our demo page https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
  2. Automatic speech recognition: A deep learning approach, 2008.
  3. Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023a.
  4. Soundstorm: Efficient parallel audio generation. arXiv preprint arXiv:2305.09636, 2023b.
  5. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  6. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
  7. Simple and controllable music generation. arXiv preprint arXiv:2306.05284, 2023.
  8. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
  9. Learning-rate-free learning by d-adaptation. arXiv preprint arXiv:2301.07733, 2023.
  10. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020.
  11. Singsong: Generating musical accompaniments from singing. arXiv preprint arXiv:2301.12662, 2023.
  12. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
  13. S Forsgren and H Martiros. Riffusion-stable diffusion for real-time music generation. 2022. URL https://riffusion. com/about, 2022.
  14. Foley music: Learning to generate music from videos. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16. Springer, 2020.
  15. Vampnet: Music generation via masked acoustic token modeling. arXiv preprint arXiv:2307.04686, 2023.
  16. Augmentation invariant discrete representation for generative spoken language modeling. In IWSLT, 2023.
  17. Mask-predict: Parallel decoding of conditional masked language models. arXiv preprint arXiv:1904.09324, 2019.
  18. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  19. The curious case of neural text degeneration, 2020.
  20. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.
  21. Mulan: A joint embedding of music audio and natural language. arXiv preprint arXiv:2208.12415, 2022.
  22. Noise2music: Text-conditioned music generation with diffusion models. arXiv preprint arXiv:2302.03917, 2023a.
  23. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint arXiv:2301.12661, 2023b.
  24. Text-free prosody-aware generative spoken language modeling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022.
  25. Fréchet audio distance: A metric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466, 2018.
  26. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019.
  27. Efficient training of audio transformers with patchout. arXiv preprint arXiv:2110.05069, 2021.
  28. Audiogen: Textually guided audio generation. arXiv preprint arXiv:2209.15352, 2022a.
  29. Audio language modeling using perceptually-guided discrete representations. arXiv preprint arXiv:2211.01223, 2022b.
  30. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9, 2021.
  31. Efficient neural music generation. arXiv preprint arXiv:2305.15719, 2023.
  32. Bigvgan: A universal neural vocoder with large-scale training. arXiv preprint arXiv:2206.04658, 2022.
  33. xformers: A modular and hackable transformer modelling library. https://github.com/facebookresearch/xformers, 2022.
  34. Improved masked image generation with token-critic. In European Conference on Computer Vision. Springer, 2022.
  35. Jen-1: Text-guided universal music generation with omnidirectional diffusion models. arXiv preprint arXiv:2308.04729, 2023.
  36. Rethinking evaluation in asr: Are our models robust enough? arXiv preprint arXiv:2010.11745, 2020.
  37. Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023a.
  38. Audioldm 2: Learning holistic audio generation with self-supervised pretraining. arXiv preprint arXiv:2308.05734, 2023b.
  39. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  40. Speaking style conversion with discrete self-supervised units. arXiv preprint arXiv:2212.09730, 2022.
  41. Kinyugo Maina. Msanii: High fidelity music synthesis on a shoestring budget. arXiv preprint arXiv:2301.06468, 2023.
  42. Speech resynthesis from discrete disentangled self-supervised representations. arXiv preprint arXiv:2104.00355, 2021.
  43. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 2021.
  44. Do transformers need deep long-range memory. arXiv preprint arXiv:2007.03356, 2020.
  45. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020.
  46. Crowdmos: An approach for crowdsourcing mean opinion score studies. In IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2011.
  47. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  48. Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics, 2021.
  49. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  50. Clip-sculptor: Zero-shot generation of high-fidelity and diverse shapes from natural language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  51. Mo\\\backslash\^ usai: Text-to-music generation with long-context latent diffusion. arXiv preprint arXiv:2301.11757, 2023.
  52. I hear your true colors: Image guided audio generation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.
  53. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  54. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023.
  55. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023.
  56. Diffsound: Discrete diffusion model for text-to-sound generation. arXiv preprint arXiv:2207.09983, 2022.
  57. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.
  58. Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. arXiv preprint arXiv:2303.03926, 2023.
Citations (27)

Summary

  • The paper presents MAGNeT, a novel transformer model that uses iterative masked prediction to achieve a 7x faster inference speed than traditional autoregressive methods.
  • It employs a multi-stream discrete representation via EnCodec and incorporates restricted contextual attention along with an external rescoring mechanism to enhance audio quality.
  • Experimental results on text-to-music and text-to-audio tasks demonstrate that MAGNeT matches the quality of autoregressive models while significantly reducing latency for real-time applications.

Overview of "Masked Audio Generation using a Single Non-Autoregressive Transformer"

Introduction

The paper "Masked Audio Generation using a Single Non-Autoregressive Transformer" introduces MAGNeT—Masked Audio Generation using a Non-autoregressive Transformer. MAGNeT leverages advancements in self-supervised representation learning, sequence modeling, and audio synthesis to enhance high-quality conditional audio generation tasks, specifically text-to-music and text-to-audio generation. The key innovation lies in its non-autoregressive nature, significantly reducing inference time while maintaining competitive quality compared to autoregressive baselines.

Methodology

MAGNeT adopts a unique approach centered around transforming audio signals into a multi-stream discrete representation via EnCodec. These representations form the foundation for the generative process. The non-autoregressive transformer employed in MAGNeT undergoes iterative masked prediction, where spans of masked tokens are predicted until a complete output sequence is constructed.

Key Features and Model Enhancements

  1. Non-Autoregressive Architecture: MAGNeT operates over several streams of audio tokens, unlike conventional autoregressive models. This approach enhances the generation speed, making it suitable for low-latency applications.
  2. Span Masking: To counteract the issue of information sharing among adjacent audio tokens, the model utilizes token spans as the atomic unit for masking, instead of individual tokens. The optimal span length was empirically determined to be approximately 60 milliseconds.
  3. Restricted Contextual Attention: The authors introduce a method to restrict the self-attention mechanism of higher codebooks to a local temporal window, optimizing model performance by reducing unnecessary contextual dependencies. This restriction aligns with the local nature of quantization error encoded in successive residual vector quantization (RVQ) stages.
  4. Rescoring Mechanism: MAGNeT incorporates an innovative rescoring technique involving external pre-trained models. During inference, an external model recalibrates the confidence scores of token predictions, enhancing the quality of the generated audio.
  5. Hybrid-MAGNeT: The paper outlines a hybrid approach where the model initially generates the beginning portion of the sequence autoregressively and then switches to non-autoregressive decoding for the remainder. This hybrid method balances between quality and latency benefits.

Experimental Evaluation

The effectiveness of MAGNeT is validated through extensive empirical evaluation on both text-to-music and text-to-audio generation tasks. The evaluation encompasses both objective metrics (such as FAD, KL-divergence, and CLAP score) and subjective human studies.

Results

  • Performance:

MAGNeT achieves comparable or slightly inferior objective performance metrics relative to autoregressive models like MusicGen and AudioGen, while drastically reducing inference time. For instance, MAGNeT is approximately 7 times faster than the autoregressive baseline, making it highly suitable for real-time applications.

  • Human Studies:

Subjective evaluations reveal that MAGNeT's output quality is on par with autoregressive models in terms of overall quality and text relevance.

  • Ablation Studies:

The paper identifies the impact of span masking, restricted context, and classifier-free guidance annealing on model performance. Span masking and attention restrictions each contribute significantly to enhancing audio generation quality.

Discussion and Future Work

The paper thoughtfully discusses the broader implications of adopting a non-autoregressive approach. The primary advantage lies in the substantial reduction in latency, crucial for interactive applications like digital audio workstations. However, the inherent design of re-encoding the entire sequence at each decoding step limits scalability in terms of throughput for large batch sizes. Future research directions could involve exploring caching strategies for non-autoregressive models to further optimize performance.

Conclusion

MAGNeT exemplifies a significant stride in non-autoregressive audio generation. By incorporating span masking, restricted attention, and an innovative rescoring mechanism, the model achieves high-quality audio generation with a notable reduction in latency. The hybrid model introduced also demonstrates flexibility in balancing quality and efficiency trade-offs. While some challenges remain, particularly around scaling for higher throughput, MAGNeT presents a promising approach for real-time audio generation tasks, paving the way for future enhancements in non-left-to-right model decoding.

Contributions and Implications

The contributions of this paper are manifold:

  1. Introduction of the MAGNeT model, revolutionizing slow autoregressive methods with a more efficient non-autoregressive approach.
  2. Empirical validation across diverse audio generation tasks, substantiating the model's efficacy and potential.
  3. Exploring theoretical implications and trade-offs, setting a foundation for future research in optimizing non-autoregressive models.

Future Developments

Further investigations might consider advanced sampling and rescoring techniques, more efficient architectures, and extending the hybrid model’s capabilities. By continually refining these methods, the field of AI-driven audio generation stands to achieve new benchmarks in quality, efficiency, and practicality.

This essay provides an in-depth analysis of the paper, tailored for experienced researchers in the field of computer science and AI, with a focus on logical and detailed exposition of the methodology, results, and implications.

Youtube Logo Streamline Icon: https://streamlinehq.com