Masked Audio Generation using a Single Non-Autoregressive Transformer (2401.04577v2)

Published 9 Jan 2024 in cs.SD, cs.AI, cs.LG, and eess.AS

Abstract: We introduce MAGNeT, a masked generative sequence modeling method that operates directly over several streams of audio tokens. Unlike prior work, MAGNeT is comprised of a single-stage, non-autoregressive transformer. During training, we predict spans of masked tokens obtained from a masking scheduler, while during inference we gradually construct the output sequence using several decoding steps. To further enhance the quality of the generated audio, we introduce a novel rescoring method in which, we leverage an external pre-trained model to rescore and rank predictions from MAGNeT, which will be then used for later decoding steps. Lastly, we explore a hybrid version of MAGNeT, in which we fuse between autoregressive and non-autoregressive models to generate the first few seconds in an autoregressive manner while the rest of the sequence is being decoded in parallel. We demonstrate the efficiency of MAGNeT for the task of text-to-music and text-to-audio generation and conduct an extensive empirical evaluation, considering both objective metrics and human studies. The proposed approach is comparable to the evaluated baselines, while being significantly faster (x7 faster than the autoregressive baseline). Through ablation studies and analysis, we shed light on the importance of each of the components comprising MAGNeT, together with pointing to the trade-offs between autoregressive and non-autoregressive modeling, considering latency, throughput, and generation quality. Samples are available on our demo page https://pages.cs.huji.ac.il/adiyoss-lab/MAGNeT.

References (58)

Citations (27)

View on Semantic Scholar

Summary

The paper presents MAGNeT, a novel transformer model that uses iterative masked prediction to achieve a 7x faster inference speed than traditional autoregressive methods.
It employs a multi-stream discrete representation via EnCodec and incorporates restricted contextual attention along with an external rescoring mechanism to enhance audio quality.
Experimental results on text-to-music and text-to-audio tasks demonstrate that MAGNeT matches the quality of autoregressive models while significantly reducing latency for real-time applications.

Overview of "Masked Audio Generation using a Single Non-Autoregressive Transformer"

Introduction

The paper "Masked Audio Generation using a Single Non-Autoregressive Transformer" introduces MAGNeT—Masked Audio Generation using a Non-autoregressive Transformer. MAGNeT leverages advancements in self-supervised representation learning, sequence modeling, and audio synthesis to enhance high-quality conditional audio generation tasks, specifically text-to-music and text-to-audio generation. The key innovation lies in its non-autoregressive nature, significantly reducing inference time while maintaining competitive quality compared to autoregressive baselines.

Methodology

MAGNeT adopts a unique approach centered around transforming audio signals into a multi-stream discrete representation via EnCodec. These representations form the foundation for the generative process. The non-autoregressive transformer employed in MAGNeT undergoes iterative masked prediction, where spans of masked tokens are predicted until a complete output sequence is constructed.

Key Features and Model Enhancements

Non-Autoregressive Architecture: MAGNeT operates over several streams of audio tokens, unlike conventional autoregressive models. This approach enhances the generation speed, making it suitable for low-latency applications.
Span Masking: To counteract the issue of information sharing among adjacent audio tokens, the model utilizes token spans as the atomic unit for masking, instead of individual tokens. The optimal span length was empirically determined to be approximately 60 milliseconds.
Restricted Contextual Attention: The authors introduce a method to restrict the self-attention mechanism of higher codebooks to a local temporal window, optimizing model performance by reducing unnecessary contextual dependencies. This restriction aligns with the local nature of quantization error encoded in successive residual vector quantization (RVQ) stages.
Rescoring Mechanism: MAGNeT incorporates an innovative rescoring technique involving external pre-trained models. During inference, an external model recalibrates the confidence scores of token predictions, enhancing the quality of the generated audio.
Hybrid-MAGNeT: The paper outlines a hybrid approach where the model initially generates the beginning portion of the sequence autoregressively and then switches to non-autoregressive decoding for the remainder. This hybrid method balances between quality and latency benefits.

Experimental Evaluation

The effectiveness of MAGNeT is validated through extensive empirical evaluation on both text-to-music and text-to-audio generation tasks. The evaluation encompasses both objective metrics (such as FAD, KL-divergence, and CLAP score) and subjective human studies.

Results

Performance:

MAGNeT achieves comparable or slightly inferior objective performance metrics relative to autoregressive models like MusicGen and AudioGen, while drastically reducing inference time. For instance, MAGNeT is approximately 7 times faster than the autoregressive baseline, making it highly suitable for real-time applications.

Human Studies:

Subjective evaluations reveal that MAGNeT's output quality is on par with autoregressive models in terms of overall quality and text relevance.

Ablation Studies:

The paper identifies the impact of span masking, restricted context, and classifier-free guidance annealing on model performance. Span masking and attention restrictions each contribute significantly to enhancing audio generation quality.

Discussion and Future Work

The paper thoughtfully discusses the broader implications of adopting a non-autoregressive approach. The primary advantage lies in the substantial reduction in latency, crucial for interactive applications like digital audio workstations. However, the inherent design of re-encoding the entire sequence at each decoding step limits scalability in terms of throughput for large batch sizes. Future research directions could involve exploring caching strategies for non-autoregressive models to further optimize performance.

Conclusion

MAGNeT exemplifies a significant stride in non-autoregressive audio generation. By incorporating span masking, restricted attention, and an innovative rescoring mechanism, the model achieves high-quality audio generation with a notable reduction in latency. The hybrid model introduced also demonstrates flexibility in balancing quality and efficiency trade-offs. While some challenges remain, particularly around scaling for higher throughput, MAGNeT presents a promising approach for real-time audio generation tasks, paving the way for future enhancements in non-left-to-right model decoding.

Contributions and Implications

The contributions of this paper are manifold:

Introduction of the MAGNeT model, revolutionizing slow autoregressive methods with a more efficient non-autoregressive approach.
Empirical validation across diverse audio generation tasks, substantiating the model's efficacy and potential.
Exploring theoretical implications and trade-offs, setting a foundation for future research in optimizing non-autoregressive models.

Future Developments

Further investigations might consider advanced sampling and rescoring techniques, more efficient architectures, and extending the hybrid model’s capabilities. By continually refining these methods, the field of AI-driven audio generation stands to achieve new benchmarks in quality, efficiency, and practicality.

This essay provides an in-depth analysis of the paper, tailored for experienced researchers in the field of computer science and AI, with a focus on logical and detailed exposition of the methodology, results, and implications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/AIatMeta/status/1757825426272235661

https://twitter.com/arankomatsuzaki/status/1744914591296188616

https://twitter.com/adiyossLC/status/1746963645395849636

https://twitter.com/lonziks/status/1746951834235748666

https://twitter.com/Gradio/status/1788543658418983424

https://twitter.com/samim/status/1745001916709433812

YouTube

Show All Videos

HackerNews

Masked Audio Generation Using a Single Non-Autoregressive Transformer (1 point, 0 comments)