MOVA: Towards Scalable and Synchronized Video-Audio Generation

SII-OpenMOSS Team, Donghua Yu, Mingshu Chen, Qi Chen, Qi Luo, Qianyi Wu, Qinyuan Cheng, Ruixiao Li, Tianyi Liang, Wenbo Zhang, Wenming Tu, Xiangyu Peng, Yang Gao, Yanru Huo, Ying Zhu, Yinze Luo, Yiyang Zhang, Yuerong Song, Zhe Xu, Zhiyu Zhang, Chenchen Yang, Cheng Chang

2026-02-10

MOVA: Towards Scalable and Synchronized Video-Audio Generation

Summary

This paper introduces MOVA, a new open-source model that can create videos *with* realistic and synchronized audio, all at once.

What's the problem?

Currently, making videos with good audio is tricky. Most systems create the video and audio separately, which is expensive, can lead to mistakes, and doesn't always result in a polished final product. While some newer systems can do both at the same time, they aren't publicly available, making it hard for researchers and creators to build upon that work and improve things. Creating audio and video together also presents technical hurdles in how the model is built, the data it's trained on, and the training process itself.

What's the solution?

The researchers developed MOVA, a model with 32 billion parameters (though only 18 billion are used at a time) that uses a 'Mixture-of-Experts' approach. This means it's really good at handling different kinds of audio-visual content. MOVA can take both images and text as input and generate a video with matching audio, including realistic lip-syncing, sounds that fit the environment shown in the video, and even music that complements the scene. Importantly, they're releasing the model's code and data so others can use and improve it.

Why it matters?

MOVA is important because it's an open-source system that allows anyone to generate high-quality audio-visual content. By making the model publicly available, the researchers hope to encourage further research and innovation in this field, and empower a wider range of creators to produce compelling videos with synchronized audio.

Abstract

Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed-source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music. MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference. It supports IT2VA (Image-Text to Video-Audio) generation task. By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators. The released codebase features comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement.

View Paper