Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis (2502.01084v2)

Published 3 Feb 2025 in cs.LG, cs.SD, and eess.AS

Abstract: We propose a novel autoregressive modeling approach for speech synthesis, combining a variational autoencoder (VAE) with a multi-modal latent space and an autoregressive model that uses Gaussian Mixture Models (GMM) as the conditional probability distribution. Unlike previous methods that rely on residual vector quantization, our model leverages continuous speech representations from the VAE's latent space, greatly simplifying the training and inference pipelines. We also introduce a stochastic monotonic alignment mechanism to enforce strict monotonic alignments. Our approach significantly outperforms the state-of-the-art autoregressive model VALL-E in both subjective and objective evaluations, achieving these results with only 10.3\% of VALL-E's parameters. This demonstrates the potential of continuous speech LLMs as a more efficient alternative to existing quantization-based speech LLMs. Sample audio can be found at https://tinyurl.com/gmm-lm-tts.

Summary

The paper introduces a continuous autoregressive VAE with a GMM latent space, reducing parameter count while enhancing synthesis quality.
It integrates a stochastic monotonic alignment mechanism that ensures precise temporal coherence between encoding and decoding steps.
Empirical results demonstrate superior MOS and WER scores over state-of-the-art models, highlighting efficient and robust performance.

Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis

The paper presents a novel approach to speech synthesis through autoregressive modeling, introducing several advancements over existing methods. The authors propose a combination of a variational autoencoder (VAE) with a Gaussian Mixture Models (GMM) latent space and an autoregressive model for generating speech. This approach diverges from traditional reliance on vector quantization, leveraging instead continuous latent representations. The purpose of these changes is to optimize the synthesis process by reducing both complexity and computation demand while maintaining or improving output quality.

Methodological Advancements

The key innovation in this work is the substitution of residual vector quantization with a continuous speech representation using a VAE model constrained by a GMM prior. The GMM-VAE compacts speech data into multi-modal latent distributions, enabling efficient compression without quantization artifacts common in previous models. The autoregressive framework, employing GMMs, captures the sequential dependencies in the latent space to synthesize high-quality speech from text inputs.

A significant aspect of this paper is the introduction of a stochastic monotonic alignment mechanism. This mechanism ensures strict alignment between encoding and decoding steps, essential for maintaining the temporal coherence in speech synthesis. It leverages a sampling method reliant on Bernoulli distribution paired with Gumbel-Softmax relaxation, promoting stable training and robust performance across diverse datasets.

Numerical Results and Implications

Empirically, the model surpasses the performance of the state-of-the-art autoregressive model VALL-E in both subjective (MOS scores) and objective (WER) evaluations. Notably, these improvements are achieved using only 10.3% of VALL-E's parameters, underscoring the computational efficiency of the method. The reduction in parameters, accompanied by improved performance metrics, indicates potential for wider application, particularly in resource-constrained environments where computational efficiency is paramount.

The paper's claims have substantial implications for the field of speech synthesis. By demonstrating superior speech quality with fewer resources, the proposed model challenges the established norm that increased complexity and parameter count are necessary for better performance. Furthermore, abandoning vector quantization alleviates the need for extensive codebook management, simplifying the implementation and scaling process.

Future Directions and Theoretical Implications

The development of continuous latent space models for speech synthesis suggests exciting future directions for research. By focusing on intuitive latent space structures and efficient modeling of sequence dependencies, further reductions in model size and computational cost can be pursued without compromising on output fidelity. Additionally, the application of such models can be extended beyond speech synthesis to other domains requiring nuanced sequence generation, such as music composition and video synthesis.

Theoretically, this work contributes to the corpus of autoregressive modeling by offering a robust alternative to discrete token sequences, which are traditionally challenging when model scales and variety increase. Continuous models harbor potential for integration with emerging AI tools, like diffusion models and flow-based generative frameworks, to further enhance the capabilities of automated content creation systems.

In summary, the paper provides a compelling case for re-evaluating the architectural choices in speech synthesis, particularly emphasizing the benefits of continuous autoregressive modeling and stochastic monotonic alignment. As AI continues to expand its presence in creative and communicative technologies, innovations like those presented here will be critical in shaping future developments.