It's Raw! Audio Generation with State-Space Models (2202.09729v1)

Published 20 Feb 2022 in cs.SD, cs.AI, cs.LG, and eess.AS

Abstract: Developing architectures suitable for modeling raw audio is a challenging problem due to the high sampling rates of audio waveforms. Standard sequence modeling approaches like RNNs and CNNs have previously been tailored to fit the demands of audio, but the resultant architectures make undesirable computational tradeoffs and struggle to model waveforms effectively. We propose SaShiMi, a new multi-scale architecture for waveform modeling built around the recently introduced S4 model for long sequence modeling. We identify that S4 can be unstable during autoregressive generation, and provide a simple improvement to its parameterization by drawing connections to Hurwitz matrices. SaShiMi yields state-of-the-art performance for unconditional waveform generation in the autoregressive setting. Additionally, SaShiMi improves non-autoregressive generation performance when used as the backbone architecture for a diffusion model. Compared to prior architectures in the autoregressive generation setting, SaShiMi generates piano and speech waveforms which humans find more musical and coherent respectively, e.g. 2x better mean opinion scores than WaveNet on an unconditional speech generation task. On a music generation task, SaShiMi outperforms WaveNet on density estimation and speed at both training and inference even when using 3x fewer parameters. Code can be found at https://github.com/HazyResearch/state-spaces and samples at https://hazyresearch.stanford.edu/sashimi-examples.

Citations (160)

View on Semantic Scholar

Summary

The paper introduces SaShiMi, a refined S4-based model that ensures stability through modified parameterization for autoregressive generation.
It employs a multi-scale design with pooling layers to capture long-range dependencies and improve computational efficiency.
The architecture achieves state-of-the-art results in music and speech generation, outperforming traditional models like WaveNet.

Overview of "It's Raw! Audio Generation with State-Space Models"

This paper addresses the intricate challenge of modeling raw audio waveforms, focusing on computational efficiency and global coherence in generation. Traditional models like CNNs and RNNs are often inefficient or limited in context size for the high sampling rates and long sequence lengths inherent in audio data. The authors introduce a novel architecture named SaShiMi, leveraging advancements in state-space models (SSMs) to overcome these limitations.

SaShiMi Architecture

SaShiMi is founded on the S4 model, known for its effectiveness in long sequence modeling. The architecture integrates a multi-scale design that progressively downsamples input sequences through pooling layers and reconstructs them to maintain coherence across different resolutions. The authors enhance S4's stability for autoregressive (AR) generation by modifying its parameterization, addressing issues with numerical stability when generating waveforms.

Key Contributions

Stability Enhancement: The S4 model's parameterization is refined to ensure that the state matrix remains Hurwitz, crucial for maintaining stability in continuous-time signals.
Multi-Scale Framework: SaShiMi includes pooling layers to capture information across different time scales, improving both performance and efficiency.
AR and Non-AR Versatility: While excelling in AR tasks, SaShiMi also proves effective as a backbone for diffusion models, showcasing its adaptability.

Performance Evaluation

SaShiMi achieves state-of-the-art results in unconditional waveform generation, with significant improvements in human-perceived musicality and coherence over prior architectures like WaveNet. Quantitatively, the model delivers better negative log-likelihood (NLL) scores and operates with reduced parameters, leading to enhanced training and inference speed.

Music Generation: On datasets like Beethoven and YouTubeMix, SaShiMi demonstrates superior performance, significantly improving musicality MOS scores and achieving faster convergence.
Speech Generation: On the SC09 dataset, SaShiMi excels in generating intelligible and coherent speech. Its integration with diffusion models like DiffWave results in new state-of-the-art scores on several metrics, highlighting its ability to improve existing models.

Implications and Future Directions

The proposed architecture is a promising alternative to conventional models like WaveNet, offering efficient training and generation. SaShiMi's ability to model audio across extended contexts could influence various applications in audio synthesis, speech recognition, and generative music. Future explorations could involve broader applications of SSMs in multimodal tasks or more complex generative scenarios.

This paper advances the state-of-the-art in raw audio modeling by introducing a stable, efficient, and versatile architecture, with SaShiMi potentially setting a new standard for audio waveform generation across research areas.

PDF Markdown

Related Papers

GitHub

GitHub - state-spaces/s4: Structured state space sequence models (2,674 stars)