S2 Pro uses a Dual-Autoregressive architecture with a 4B-parameter Slow AR component for semantic prediction and a 400M-parameter Fast AR component for acoustic detail. Fish Audio reports training on more than 10 million hours of audio, support for 80+ languages, over 15,000 natural-language control tags, and an SGLang-based streaming inference engine.
Fish Audio S2 is useful for researchers, developers, and creative voice teams that want more control than a fixed voice preset library. The release includes inference code, model weights, fine-tuning support, and self-hosting paths for teams that can operate GPU infrastructure, while commercial use requires a separate Fish Audio license.


