Key Features

Open-source expressive TTS release with public inference code and model weights.
Uses the S2 Pro model for realistic multilingual speech synthesis.
Supports 80+ languages, with highest-quality tiers for Japanese, English, and Chinese.
Provides fine-grained inline control through natural-language bracket tags.
Supports more than 15,000 expressive tags, including pauses, whispers, laughter, singing, and emphasis.
Supports multi-speaker generation with speaker control tokens.
Uses a Dual-AR architecture with 4B Slow AR and 400M Fast AR components.
Provides SGLang-based streaming inference with reported ~100 ms TTFA on H200-class serving.

S2 Pro uses a Dual-Autoregressive architecture with a 4B-parameter Slow AR component for semantic prediction and a 400M-parameter Fast AR component for acoustic detail. Fish Audio reports training on more than 10 million hours of audio, support for 80+ languages, over 15,000 natural-language control tags, and an SGLang-based streaming inference engine.


Fish Audio S2 is useful for researchers, developers, and creative voice teams that want more control than a fixed voice preset library. The release includes inference code, model weights, fine-tuning support, and self-hosting paths for teams that can operate GPU infrastructure, while commercial use requires a separate Fish Audio license.

Get more likes & reach the top of search results by adding this button on your site!

Embed button preview - Light theme
Embed button preview - Dark theme
TurboType Banner
Zero to AI Engineer Program

Zero to AI Engineer

Skip the degree. Learn real-world AI skills used by AI researchers and engineers. Get certified in 8 weeks or less. No experience required.

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!