MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis

Ziyue Jiang, Yi Ren, Ruiqi Li, Shengpeng Ji, Boyang Zhang, Zhenhui Ye, Chen Zhang, Bai Jionghao, Xiaoda Yang, Jialong Zuo, Yu Zhang, Rui Liu, Xiang Yin, Zhou Zhao

2025-04-03

MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for
Zero-Shot Speech Synthesis

Summary

This paper introduces MegaTTS 3, an AI system that can create realistic-sounding speech from text, even in new voices it has never heard before.

What's the problem?

Existing AI models that create speech from text often struggle with accurately matching the words to the sounds, or they sound unnatural because they rely on pre-set alignments.

What's the solution?

MegaTTS 3 uses a new method that provides some guidance on how the words and sounds should align, but it still allows for flexibility, which results in more natural-sounding speech. It can also adjust the intensity of accents and generate speech quickly.

Why it matters?

This matters because it can lead to more realistic and expressive AI voices, which could be used in applications like audiobooks, virtual assistants, and accessibility tools.

Abstract

While recent zero-shot text-to-speech (TTS) models have significantly improved speech quality and expressiveness, mainstream systems still suffer from issues related to speech-text alignment modeling: 1) models without explicit speech-text alignment modeling exhibit less robustness, especially for hard sentences in practical applications; 2) predefined alignment-based models suffer from naturalness constraints of forced alignments. This paper introduces MegaTTS 3, a TTS system featuring an innovative sparse alignment algorithm that guides the latent diffusion transformer (DiT). Specifically, we provide sparse alignment boundaries to MegaTTS 3 to reduce the difficulty of alignment without limiting the search space, thereby achieving high naturalness. Moreover, we employ a multi-condition classifier-free guidance strategy for accent intensity adjustment and adopt the piecewise rectified flow technique to accelerate the generation process. Experiments demonstrate that MegaTTS 3 achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity. Notably, our system can generate high-quality one-minute speech with only 8 sampling steps. Audio samples are available at https://sditdemo.github.io/sditdemo/.

View Paper