Key Features

High-quality 5B parameter audio model with twin backbone architecture
Precise lip synchronization without explicit face bounding boxes
Supports realistic multi-speaker and multi-turn video conversations
Generates synchronized background music and sound effects
Open-source release of pretrained models and inference code

This cutting-edge technology naturally supports multiple speakers and multi-turn conversations, enabling the creation of complex, realistic dialogue scenarios in videos. Beyond lip-syncing, Ovi is capable of producing synchronized background music and sound effects that correspond directly with visual actions, enhancing the overall audiovisual experience. The tool targets both research and open-source communities by providing full pretrained model weights and inference code for further development and application.


Ovi’s demonstration clips, resized to 480p for optimal storage efficiency, showcase its capabilities using reference images sourced from public domains or AI-generated content. The developers emphasize ethical use by inviting contact to address any concerns related to the imagery used. As a state-of-the-art research project, Ovi pushes the boundaries of audio and video fusion technology to facilitate innovative multimedia generation workflows.

Get more likes & reach the top of search results by adding this button on your site!

Embed button preview - Light theme
Embed button preview - Dark theme
TurboType Banner

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!