VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild (2403.16973v3)

Published 25 Mar 2024 in eess.AS, cs.AI, cs.CL, cs.LG, and cs.SD

Abstract: We introduce VoiceCraft, a token infilling neural codec LLM, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts. VoiceCraft employs a Transformer decoder architecture and introduces a token rearrangement procedure that combines causal masking and delayed stacking to enable generation within an existing sequence. On speech editing tasks, VoiceCraft produces edited speech that is nearly indistinguishable from unedited recordings in terms of naturalness, as evaluated by humans; for zero-shot TTS, our model outperforms prior SotA models including VALLE and the popular commercial model XTTS-v2. Crucially, the models are evaluated on challenging and realistic datasets, that consist of diverse accents, speaking styles, recording conditions, and background noise and music, and our model performs consistently well compared to other models and real recordings. In particular, for speech editing evaluation, we introduce a high quality, challenging, and realistic dataset named RealEdit. We encourage readers to listen to the demos at https://jasonppy.github.io/VoiceCraft_web.

References (64)

Citations (40)

View on Semantic Scholar

Summary

The paper introduces VoiceCraft, a novel token infilling neural codec language model for zero-shot speech editing and TTS.
It employs a Transformer-based architecture with causal masking and delayed stacking to enhance sequence generation and naturalness.
VoiceCraft achieves state-of-the-art performance on the RealEdit dataset with 48% listener preference and superior WER metrics.

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

The paper "VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild" focuses on advancing neural codec LLMs (NCLMs) for performing state-of-the-art speech editing and zero-shot text-to-speech (TTS) in a variety of challenging real-world conditions. It specifically targets applications involving diverse accents, speaking styles, and intricate background conditions, utilizing a novel token infilling model to achieve its objectives.

Model Architecture and Methodology

VoiceCraft employs a token infilling neural codec LLM based on a Transformer decoder architecture. Its core innovation lies in the token rearrangement procedure which comprises two main steps: causal masking and delayed stacking.

Causal Masking: This step allows for autoregressive generation by realigning tokens to ensure the model can condition on both past and future tokens within the sequence. This adjustment is crucial for effective infilling operations required in speech editing and TTS tasks.
Figure 1: An example of the token rearrangement procedure and modeling framework. The rearrangement procedure involves two steps: (1) Causal masking, where masked spans are replaced with mask tokens and moved to the end, and (2) Delayed stacking, where tokens are shifted in the time dimension based on their codebook index.
Delayed Stacking: This step improves multi-codebook modeling by rearranging and conditioning codebooks more efficiently, facilitating effective sequence generation. This is achieved by stacking the resultant vectors with specific delay patterns.

The model is trained autoregressively on sequences, optimized using a multi-codebook loss function that emphasizes the initial codebook layers more substantially than later layers.

Evaluation and Dataset

The paper introduces a novel dataset called RealEdit, designed to evaluate the practicality and robustness of speech editing models. RealEdit includes 310 speech editing examples constructed from a diverse range of sources such as audiobooks, YouTube videos, and Spotify podcasts, making it a highly representative and challenging dataset.

Speech Editing Performance

VoiceCraft's implementation on RealEdit reveals significant improvements over prior state-of-the-art models. Human evaluations show its edited speech is nearly indistinguishable from original recordings in naturalness. The side-by-side comparisons reveal human listeners prefer VoiceCraft over original unedited speech 48% of the time.

Figure 2: Speech editing with VoiceCraft. Human listeners prefer VoiceCraft edited speech over the original real recording 48% of the time in side-by-side naturalness comparison.

Zero-Shot Text-to-Speech Synthesis

VoiceCraft's capability extends to zero-shot TTS, outperforming competing state-of-the-art models, such as VALL-E and the commercial XTTS v2. The model demonstrates superior performance across objective metrics like WER, as well as human-rated naturalness and intelligibility, without needing additional fine-tuning for voices unseen during training.

Experimental Results and Comparative Analysis

The paper presents thorough experimental results highlighting VoiceCraft's strengths in both speech editing and zero-shot TTS. The MOS (Mean Opinion Score) evaluations, alongside automatic metrics, underline the superiority of VoiceCraft in various scenarios and editing types.

Figure 3: Breakdown of side-by-side human preference on naturalness comparing of VoiceCraft and FluentSpeech on speech editing. Grouped by edit type (left) and edit span length (right).

Implementation Considerations and Challenges

The computational demands of training such a model are managed using optimized training schedules and loss weighting strategies to balance between intelligibility and prosody. Challenges such as long silence and occasional misalignments are addressed, but the paper acknowledges the need for further refinement in these areas.

Conclusion

VoiceCraft advances the field of NCLMs by providing a robust, high-quality solution for speech editing and zero-shot TTS, validated by extensive empirical data on diverse datasets. Its innovative token rearrangement methodology offers a potent toolset for enhancing speech synthesis technologies across numerous applications.

Implications and Future Directions

The paper provides a critical foundation for future research to explore more seamless integrations of speech synthesis and editing technologies while stressing the importance of ethical considerations, such as misuse through voice cloning. Future developments could focus on refining VoiceCraft's inference capabilities and expanding its applicability to additional languages and styles.