MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer (2409.00750v3)

Published 1 Sep 2024 in cs.SD, cs.AI, cs.LG, and eess.AS

Abstract: The recent large-scale text-to-speech (TTS) systems are usually grouped as autoregressive and non-autoregressive systems. The autoregressive systems implicitly model duration but exhibit certain deficiencies in robustness and lack of duration controllability. Non-autoregressive systems require explicit alignment information between text and speech during training and predict durations for linguistic units (e.g. phone), which may compromise their naturalness. In this paper, we introduce Masked Generative Codec Transformer (MaskGCT), a fully non-autoregressive TTS model that eliminates the need for explicit alignment information between text and speech supervision, as well as phone-level duration prediction. MaskGCT is a two-stage model: in the first stage, the model uses text to predict semantic tokens extracted from a speech self-supervised learning (SSL) model, and in the second stage, the model predicts acoustic tokens conditioned on these semantic tokens. MaskGCT follows the mask-and-predict learning paradigm. During training, MaskGCT learns to predict masked semantic or acoustic tokens based on given conditions and prompts. During inference, the model generates tokens of a specified length in a parallel manner. Experiments with 100K hours of in-the-wild speech demonstrate that MaskGCT outperforms the current state-of-the-art zero-shot TTS systems in terms of quality, similarity, and intelligibility. Audio samples are available at https://maskgct.github.io/. We release our code and model checkpoints at https://github.com/open-mmlab/Amphion/blob/main/models/tts/maskgct.

PDF HTML Abstract

Overview of "MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer"

In the field of text-to-speech (TTS) system design, the paper "MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer" presents a novel model architecture termed MaskGCT. This fully non-autoregressive TTS model aims to eliminate the reliance on explicit text-speech alignment information and phone-level duration prediction which often limits the naturalness in prior systems. The research introduces a two-stage framework where the model employs a mask-and-predict learning paradigm to generate high-quality speech in a zero-shot context.

Model Architecture and Methodology

MaskGCT leverages masked generative transformers, exploring their applications within the TTS domain. The system is bifurcated into two principal stages:

Text-to-Semantic (T2S) Conversion: In this stage, the TTS model predicts semantic tokens from input text. These semantic tokens are extracted from an SSL speech model and contain rich content-bearing features. The T2S model dynamically generates semantic tokens of predefined lengths through iterative parallel decoding, accommodating input from both text sequences and prompt semantic tokens.
Semantic-to-Acoustic (S2A) Conversion: The S2A model predicts acoustic tokens based on semantic tokens alongside additional prompts. It employs layer-wise generation using multi-layer acoustic token sequences to preserve nuanced speech characteristics. This step effectively bridges semantic representations with tangible acoustic outputs without explicit duration supervision.

A distinguishing factor in MaskGCT is the adaptation of non-autoregressive masked generative paradigms traditionally seen in image and video generation into the audio domain, specifically TTS. This framework mitigates common autoregressive drawbacks like robustness issues and slower inference speeds.

Experimental Evaluation

Notably, MaskGCT is tested against 100K hours of in-the-wild speech data, achieving superior performance benchmarks when compared to state-of-the-art zero-shot TTS systems. The evaluation encapsulates assessments on quality, similarity to prompt speech, and intelligibility. Objective and subjective metrics elucidate MaskGCT's capabilities, with highlights including:

Achieving human-level similarity and naturalness, outperforming other models in various multilingual environments.
Demonstrating adaptability in speech translation, content editing, and style imitation tasks, further underscoring MaskGCT's versatility as a foundational model.

Key Implications

From a theoretical vantage, the research challenges the conventional reliance on explicit alignment and autoregressive modeling in TTS. Practically, this shifts the scope towards more efficient, robust, and versatile TTS systems capable of adaptive zero-shot learning across diverse linguistic and emotive contexts.

Furthermore, by training the system across multi-lingual datasets, MaskGCT expands potential applications in cross-lingual speech translation and other emergent speech-based AI tasks. This adaptability marks a step toward more comprehensive audio AI models.

Future Prospects

The exploration of masked generative transformers within TTS opens pathways for future research, particularly in enhancing model efficiency and expanding the diversity of speech outputs for more fine-grained voice control. Potential areas of exploration include refining in-context learning capacities, scaling model data and parameters, and developing advanced speech editing capabilities.

In conclusion, MaskGCT represents a significant stride in TTS research, providing a non-autoregressive alternative that addresses previous limitations and expands applicability within AI-driven speech synthesis.