DMOSpeech 2: Reinforcement Learning for Duration Prediction in Metric-Optimized Speech Synthesis

Yinghao Aaron Li, Xilin Jiang, Fei Tao, Cheng Niu, Kaifeng Xu, Juntong Song, Nima Mesgarani

2025-07-25

DMOSpeech 2: Reinforcement Learning for Duration Prediction in
Metric-Optimized Speech Synthesis

Summary

This paper talks about DMOSpeech 2, a system that improves text-to-speech by using reinforcement learning to better predict how long each part of the speech should be.

What's the problem?

In text-to-speech systems, predicting the correct duration of sounds is hard but important because errors can make the speech sound unnatural or unclear, and previous methods didn’t optimize this well.

What's the solution?

The researchers used reinforcement learning to train a duration predictor that optimizes speech quality metrics like speaker similarity and word clarity. They also introduced a teacher-guided sampling method that improves variability in speech while keeping the process efficient.

Why it matters?

This matters because DMOSpeech 2 makes AI voices sound more natural and clear, while being faster and more efficient, which is useful for applications like voice assistants, audiobooks, and accessibility tools.

Abstract

DMOSpeech 2 optimizes duration prediction in diffusion-based text-to-speech using reinforcement learning and introduces teacher-guided sampling to enhance diversity and efficiency.

View Paper