E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, Naoyuki Kanda

2024-07-02

E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

Summary

This paper talks about E2 TTS, a new and simple system for converting text into speech that sounds very natural and can mimic different speakers without needing extra training. It aims to make text-to-speech technology easier to use and more efficient.

What's the problem?

Many existing text-to-speech (TTS) systems are complex and require a lot of additional training to work well with different voices. They often need special components or techniques to ensure the generated speech sounds good and matches the speaker's voice. This makes these systems harder to implement and less flexible for users who want quick results without extensive setup.

What's the solution?

E2 TTS solves this problem by being fully non-autoregressive and zero-shot, meaning it can generate speech for any speaker without needing prior training on that specific voice. The system works by converting text into a sequence of characters with filler tokens, then using a flow-matching method to create high-quality sound. This approach eliminates the need for complicated models and allows for easy adjustments in how the input is represented. Despite its simplicity, E2 TTS achieves high-quality results that rival more complex systems.

Why it matters?

This research is important because it makes text-to-speech technology more accessible and practical for various applications. By simplifying the process and reducing the need for extensive training, E2 TTS can be used in many areas, such as virtual assistants, audiobooks, and any situation where natural-sounding speech is needed quickly. This advancement could lead to broader use of TTS technology in everyday life.

Abstract

This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-autoregressive zero-shot text-to-speech system that offers human-level naturalness and state-of-the-art speaker similarity and intelligibility. In the E2 TTS framework, the text input is converted into a character sequence with filler tokens. The flow-matching-based mel spectrogram generator is then trained based on the audio infilling task. Unlike many previous works, it does not require additional components (e.g., duration model, grapheme-to-phoneme) or complex techniques (e.g., monotonic alignment search). Despite its simplicity, E2 TTS achieves state-of-the-art zero-shot TTS capabilities that are comparable to or surpass previous works, including Voicebox and NaturalSpeech 3. The simplicity of E2 TTS also allows for flexibility in the input representation. We propose several variants of E2 TTS to improve usability during inference. See https://aka.ms/e2tts/ for demo samples.

View Paper