Text to Speech

Discover and compare the best AI models for text to speech generation. Note: This is my personal non-scientific leaderboard. Models are ranked by the completion rate of a series of diverse prompts designed to thoroughly assess performance.

Overview Text to Image Image Editing Text to Video Image to Video Text to Speech Full Body Animation Text to Music

Rank	Company	Model	Score
	Microsoft	VibeVoice-Large	88.68
	OpenBMB	VoxCPM	88.3
	Bilibili Index	IndexTTS 2	87.5
4	MiniMax	Speech-02-HD	87
5	Fish Audio	OpenAudio S1	85
6	SWivid	F5 TTS	83.95
7	RedNote	FireRedTTS2	82.7
8	ElevenLabs	Elevenlabs v3	82.65
9	Resemble AI	Chatterbox	79.4
10	Boson AI	Higgs Audio V2	79.33
11	Zyphra	Zonos	74
12	Kokoro	Kokoro 82M	71
13	Coqui	XTTS-v2	69.42

Full tutorial & review videos

Watch the videos below for comprehensive comparisons and detailed installation guides for select text-to-speech models.

VibeVoice installation & review

ElevenLabs v3

F5 TTS installation & review

Zyphra Zonos installation & review

Sesame AI review

RVC voice cloning tutorial

RVC song cover tutorial

RVC full tutorial

Methodology

Models are ranked using a series of prompts involving diverse range of challenging tasks. This includes:

Naturalness and human-like quality
Pronunciation accuracy
Emotions and expressions

Open-source vs closed-source
Different accents and languages
Voice cloning consistency

To prevent manipulation, the prompts are kept confidential and are regularly updated to increase difficulty as models improve. Here is a subset of prompts for your reference:

The record producer refused to record the band’s new single.

The wind was too strong to wind the kite string around the spool.

Are you serious? No, I’m joking—seriously, I’m not serious!

She sells seashells by the seashore, but the shells she sells aren’t cheap.

The Dr. who lives at 1234 St. Dr. prescribed 2 tsp. of medicine for Feb. 2, 2023.

The fiesta was très magnifique, with 你好 greetings and English pop music.