Continuous Speech Synthesis using per-token Latent Diffusion
Arnon Turetzky, Nimrod Shabtay, Slava Shechtman, Hagai Aronowitz, David Haws, Ron Hoory, Avihu Dekel
2024-10-28

Summary
This paper introduces SALAD, a new method for continuous speech synthesis that improves text-to-speech technology by using a per-token latent diffusion model.
What's the problem?
Traditional text-to-speech systems often use discrete tokens, which can limit the quality of the generated speech. These systems struggle with accurately representing continuous audio signals and can lead to poor sound quality. Additionally, they may require precise text-audio alignment, which can be difficult to achieve.
What's the solution?
The authors propose SALAD, a model that operates on continuous representations rather than discrete tokens. SALAD uses semantic tokens to provide context and help determine when to stop generating audio. It includes three different approaches for generating speech: predicting acoustic features directly from text (Text2Acoustic), predicting features based on semantic tokens, and combining these methods. The results show that SALAD produces speech that is more intelligible and similar to real human voices compared to traditional methods.
Why it matters?
This research is significant because it advances the field of speech synthesis by demonstrating that continuous modeling can lead to higher-quality audio generation. By improving how machines generate speech, SALAD has the potential to enhance applications like virtual assistants, audiobooks, and other technologies that rely on natural-sounding voice generation.
Abstract
The success of autoregressive transformer models with discrete tokens has inspired quantization-based approaches for continuous modalities, though these often limit reconstruction quality. We therefore introduce SALAD, a per-token latent diffusion model for zero-shot text-to-speech, that operates on continuous representations. SALAD builds upon the recently proposed expressive diffusion head for image generation, and extends it to generate variable-length outputs. Our approach utilizes semantic tokens for providing contextual information and determining the stopping condition. We suggest three continuous variants for our method, extending popular discrete speech synthesis techniques. Additionally, we implement discrete baselines for each variant and conduct a comparative analysis of discrete versus continuous speech modeling techniques. Our results demonstrate that both continuous and discrete approaches are highly competent, and that SALAD achieves a superior intelligibility score while obtaining speech quality and speaker similarity on par with the ground-truth audio.