MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder

Bowen Zhang, Congchao Guo, Geng Yang, Hang Yu, Haozhe Zhang, Heidi Lei, Jialong Mai, Junjie Yan, Kaiyue Yang, Mingqi Yang, Peikai Huang, Ruiyang Jin, Sitan Jiang, Weihua Cheng, Yawei Li, Yichen Xiao, Yiying Zhou, Yongmao Zhang, Yuan Lu, Yucen He

2025-05-14

MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable
Speaker Encoder

Summary

This paper talks about MiniMax-Speech, a new AI system that can create realistic speech in different voices, even if it has never heard those voices before, by learning directly from audio samples.

What's the problem?

The problem is that most text-to-speech systems need a lot of information about a person's voice, like written transcripts or lots of training data, to accurately copy or clone a voice, which makes it hard to use them for new speakers or in situations where you don't have much data.

What's the solution?

The researchers built MiniMax-Speech using a special part called a learnable speaker encoder, which can pick up the unique features of a person's voice just from listening to audio, without needing any written text. This lets the system create high-quality speech that sounds like the reference speaker, even if it's the first time it's heard them.

Why it matters?

This matters because it makes voice cloning much easier and more flexible, which can help in things like making audiobooks, helping people who have lost their voice, or creating more natural-sounding virtual assistants.

Abstract

MiniMax-Speech, an autoregressive Transformer-based TTS model, generates high-quality speech with a learnable speaker encoder that extracts reference speaker features without transcription, achieving SOTA results in voice cloning and supporting various extensions.

View Paper