F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching (2410.06885v2)

Published 9 Oct 2024 in eess.AS and cs.SD

Abstract: This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS. However, the original design of E2 TTS makes it hard to follow due to its slow convergence and low robustness. To address these issues, we first model the input with ConvNeXt to refine the text representation, making it easy to align with the speech. We further propose an inference-time Sway Sampling strategy, which significantly improves our model's performance and efficiency. This sampling strategy for flow step can be easily applied to existing flow matching based models without retraining. Our design allows faster training and achieves an inference RTF of 0.15, which is greatly improved compared to state-of-the-art diffusion-based TTS models. Trained on a public 100K hours multilingual dataset, our Fairytaler Fakes Fluent and Faithful speech with Flow matching (F5-TTS) exhibits highly natural and expressive zero-shot ability, seamless code-switching capability, and speed control efficiency. Demo samples can be found at https://SWivid.github.io/F5-TTS. We release all code and checkpoints to promote community development.

Authors (8)

Yushen Chen (3 papers)
Zhikang Niu (11 papers)
Ziyang Ma (73 papers)
Keqi Deng (18 papers)
Chunhui Wang (16 papers)
Jian Zhao (218 papers)
Kai Yu (202 papers)
Xie Chen (166 papers)

Citations (8)

View on Semantic Scholar

Summary

The paper introduces a novel non-autoregressive TTS system that bypasses traditional duration models by using flow matching and denoising techniques.
It leverages ConvNeXt for refined text modeling and inference-time sway sampling to enhance both convergence and efficiency in speech synthesis.
Experimental results demonstrate lower WER and high naturalness, indicating the model's potential for real-time, multilingual speech applications.

Overview of F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

The paper "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching" presents a sophisticated approach to non-autoregressive text-to-speech (TTS) synthesis using flow matching combined with a Diffusion Transformer (DiT). Aimed at improving the efficiency and fidelity of speech synthesis, this research addresses several inherent limitations of previous TTS models, including robustness and convergence challenges.

Methodology

F5-TTS employs a fully non-autoregressive framework, which allows for faster inference compared to autoregressive models. The system effectively bypasses the necessity for complex components such as duration models, text encoders, and phoneme alignment by utilizing a simple padding of filler tokens to match the text input length with the speech length, followed by denoising to generate speech. This is inspired by E2 TTS, although the original faced issues like slow convergence and low robustness.

To tackle these issues, this paper introduces several innovative strategies:

Input Modeling with ConvNeXt: The text representation is refined using ConvNeXt, improving compatibility with speech data and enhancing alignment.
Inference-Time Sway Sampling: A novel sampling strategy improves both performance and efficiency, and is adaptable to existing flow matching models without the need for retraining.
Flow Matching Objective: Utilizing Conditional Flow Matching (CFM) and leveraging Optimal Transport paths, the model efficiently maps from initial distributions to target distributions in speech synthesis tasks.

The proposed F5-TTS model not only surpasses existing models' inference capabilities, achieving an inference Real-Time Factor (RTF) of 0.15—significantly faster than prevailing diffusion-based models—but also excels in synthesizing natural and expressive speech, even in zero-shot scenarios.

Results and Discussion

The results demonstrate substantial improvements over previous models. The F5-TTS model exhibited a lower Word Error Rate (WER) and robust performance across various test sets, achieving superior naturalness and speaker similarity. These measures indicate the system's enhanced fidelity and ability to closely mimic human speech characteristics, highlighting its potential applicability in real-world scenarios where naturalness and speed are critical.

Implications and Future Directions

The methodological advancements presented in F5-TTS have several theoretical and practical implications. The proposed Sway Sampling technique, alongside the integration of ConvNeXt for text modeling, offers new dimensions for enhancing non-autoregressive TTS systems. Moreover, the model's ability to manage diverse linguistic contexts through seamless code-switching expands its utility in multilingual environments.

Future research could explore further optimization of the system's efficiency and robustness, investigating alternative architectures or hybrid models that leverage the strengths of both autoregressive and non-autoregressive approaches. Additionally, extending this framework to other generative tasks, such as music or image generation, could reveal broader applications of the underlying flow matching techniques.

In conclusion, the F5-TTS model introduces a notable progression in the field of TTS, balancing high-quality synthesis with computational efficiency and positioning itself as a viable option for practical implementation in various AI-driven applications.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/realmrfakename/status/1844408907349016826

https://twitter.com/AdinaYakup/status/1845855272075497682

https://twitter.com/cyberwebz/status/1864428032905080850

https://twitter.com/susumuota/status/1845254351444480296

https://twitter.com/arXivGPT/status/1844830597094502650

https://twitter.com/IAMJBDEL/status/1852475436405776455

YouTube

Show All Videos