Text to Speech
Discover and compare the best AI models for text to speech generation. Note: This is my personal non-scientific leaderboard. Models are ranked by the completion rate of a series of diverse prompts designed to thoroughly assess performance.
| Rank | Company | Model | Score |
|---|---|---|---|
Microsoft | 88.68 | ||
OpenBMB | 88.3 | ||
Bilibili Index | 87.5 | ||
4 | MiniMax | Speech-02-HD | 87 |
5 | Fish Audio | 85 | |
6 | SWivid | 83.95 | |
7 | RedNote | 82.7 | |
8 | ElevenLabs | 82.65 | |
9 | Resemble AI | Chatterbox | 79.4 |
10 | Boson AI | 79.33 | |
11 | Zyphra | 74 | |
12 | Kokoro | Kokoro 82M | 71 |
13 | Coqui | XTTS-v2 | 69.42 |
Full tutorial & review videos
Watch the videos below for comprehensive comparisons and detailed installation guides for select text-to-speech models.
Methodology
Models are ranked using a series of prompts involving diverse range of challenging tasks. This includes:
- Naturalness and human-like quality
- Pronunciation accuracy
- Emotions and expressions
- Open-source vs closed-source
- Different accents and languages
- Voice cloning consistency
To prevent manipulation, the prompts are kept confidential and are regularly updated to increase difficulty as models improve. Here is a subset of prompts for your reference:
The record producer refused to record the band’s new single.
The wind was too strong to wind the kite string around the spool.
Are you serious? No, I’m joking—seriously, I’m not serious!
She sells seashells by the seashore, but the shells she sells aren’t cheap.
The Dr. who lives at 1234 St. Dr. prescribed 2 tsp. of medicine for Feb. 2, 2023.
The fiesta was très magnifique, with 你好 greetings and English pop music.







