A defining feature of Dia is its sophisticated handling of non-verbal audio cues. Users can embed instructions like (laughs), (coughs), or (clears throat) directly into the text, and Dia will generate these sounds naturally within the speech output. This capability adds a layer of expressiveness and realism that is often lacking in other TTS systems. Additionally, Dia supports audio conditioning, allowing users to guide the model’s tone, emotion, or delivery style by providing a short audio sample. This feature enables a degree of voice style mimicry and emotional control, though the model does not clone specific voices by default. Each generation can produce a different voice unless a fixed seed or audio prompt is provided, offering flexibility for creative use cases.
Dia is fully open source, with both the model weights and inference code available under the permissive Apache 2.0 license. This ensures that anyone can download, modify, and deploy the model for research or commercial purposes without restriction. The model requires a modern GPU with at least 10GB of VRAM for optimal performance and runs on PyTorch 2.0+ and CUDA 12.6. Nari Labs also provides a Gradio-based demo and sample code, making it accessible for experimentation and rapid prototyping. The ongoing development roadmap includes plans for quantized and CPU-friendly versions, further lowering the barrier to entry. By providing a transparent and customizable alternative to proprietary TTS platforms, Dia empowers users with greater control over their voice synthesis workflows and data privacy.
Key features include:
- 1.6 billion parameter architecture for nuanced, lifelike speech synthesis
- Multi-speaker dialogue generation using simple speaker tags
- Supports non-verbal cues such as laughter, coughing, and throat clearing
- Audio conditioning for emotion, tone, and delivery style control
- Open-source Apache 2.0 license with accessible model weights and code
- Gradio demo and sample code for easy experimentation
- English language support with variable voice output per generation