F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, Xie Chen
2024-10-10

Summary
This paper introduces F5-TTS, a new text-to-speech system that generates fluent and natural-sounding speech using a method called flow matching with a Diffusion Transformer.
What's the problem?
Traditional text-to-speech systems often require complex components like duration models and phoneme alignment, which can make them difficult to implement and less efficient. Additionally, existing models may struggle with slow performance and robustness, leading to lower quality speech generation.
What's the solution?
F5-TTS simplifies the process by using a non-autoregressive approach, where text input is padded with filler tokens to match the length of the speech input. It employs a new method called Sway Sampling to enhance performance and efficiency during speech generation. The system is trained on a large multilingual dataset, allowing it to produce high-quality speech quickly and effectively without needing complicated setups.
Why it matters?
This research is significant because it improves how AI can convert text into speech, making it faster and more reliable. By streamlining the technology behind text-to-speech systems, F5-TTS can be used in various applications such as virtual assistants, audiobooks, and language learning tools, ultimately making communication with machines more natural and accessible.
Abstract
This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS. However, the original design of E2 TTS makes it hard to follow due to its slow convergence and low robustness. To address these issues, we first model the input with ConvNeXt to refine the text representation, making it easy to align with the speech. We further propose an inference-time Sway Sampling strategy, which significantly improves our model's performance and efficiency. This sampling strategy for flow step can be easily applied to existing flow matching based models without retraining. Our design allows faster training and achieves an inference RTF of 0.15, which is greatly improved compared to state-of-the-art diffusion-based TTS models. Trained on a public 100K hours multilingual dataset, our Fairytaler Fakes Fluent and Faithful speech with Flow matching (F5-TTS) exhibits highly natural and expressive zero-shot ability, seamless code-switching capability, and speed control efficiency. Demo samples can be found at https://SWivid.github.io/F5-TTS. We release all code and checkpoints to promote community development.