Fish Audio S2 Technical Report

Shijia Liao, Yuxuan Wang, Songting Liu, Yifan Cheng, Ruoyi Zhang, Tianyu Li, Shidong Li, Yisheng Zheng, Xingwei Liu, Qingzheng Wang, Zhizhuo Zhou, Jiahua Liu, Xin Chen, Dawei Han

2026-03-11

Summary

This paper introduces Fish Audio S2, a new text-to-speech (TTS) system that's available for anyone to use and modify. It's special because it can create speech with different voices, continue conversations naturally, and most importantly, follow specific instructions given in plain English about *how* the speech should sound.

What's the problem?

Creating realistic and controllable text-to-speech systems is really hard. Existing systems often struggle to sound natural over longer conversations, or to let users easily specify exactly what they want the voice to sound like – things like emotion, accent, or speaking style. Also, training these systems requires a lot of data and computing power, making it difficult for researchers and developers to build their own.

What's the solution?

The researchers tackled this by building a TTS system called Fish Audio S2 and developing a smart way to train it. They used a multi-stage training process and a pipeline to gather and prepare the data, including information from video and speech captions, and even a system to judge voice quality. They also created a tool that lets you easily use the system to generate speech, and it's designed to work quickly, producing audio with very little delay.

Why it matters?

This work is important because it makes advanced text-to-speech technology more accessible to everyone. By releasing the code, the trained model, and a user-friendly interface, they're allowing others to build upon their work and create even better TTS systems. The speed and control offered by Fish Audio S2 could be useful in many applications, like virtual assistants, audiobooks, and accessibility tools.

Abstract

We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language descriptions. To scale training, we develop a multi-stage training recipe together with a staged data pipeline covering video captioning and speech captioning, voice-quality assessment, and reward modeling. To push the frontier of open-source TTS, we release our model weights, fine-tuning code, and an SGLang-based inference engine. The inference engine is production-ready for streaming, achieving an RTF of 0.195 and a time-to-first-audio below 100 ms.Our code and weights are available on GitHub (https://github.com/fishaudio/fish-speech) and Hugging Face (https://huggingface.co/fishaudio/s2-pro). We highly encourage readers to visit https://fish.audio to try custom voices.

View Paper