One of the standout features of OpenAudio S1 is its comprehensive emotional and tone control, supporting over 50 emotions and tone markers such as angry, happy, sad, whisper, and sympathy. Users can modulate the speech output by issuing simple text commands to adjust speech rate, volume, pauses, and various expressive effects like laughter or whispers. The model's instruction-following capabilities enable precise customization, allowing developers to control emphasis and pacing in real-time via an API, making it versatile for diverse voice generation needs.
OpenAudio S1 supports zero-shot and few-shot voice cloning using just 10 to 30 seconds of an audio sample, producing high-fidelity clones rapidly within a minute, which is ideal for personalized audio experiences or celebrity voice simulations. The architecture features an innovative dual autoregressive design combining fast and slow Transformer modules for stable and efficient voice generation. It supports 13 languages including English, Chinese, Japanese, French, and German, with excellent accuracy and low latency performance suitable for cloud deployment or local usage. The model is available in two versions: the full 4-billion parameter S1 model via cloud services and a lightweight open-source S1-mini version optimized for research and educational use.