JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation

Kai Liu, Yanhao Zheng, Kai Wang, Shengqiong Wu, Rongjunchen Zhang, Jiebo Luo, Dimitrios Hatzinakos, Ziwei Liu, Hao Fei, Tat-Seng Chua

2026-02-26

JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation

Summary

This paper introduces JavisDiT++, a new system for creating videos with synchronized sound from just a text description, a process called joint audio-video generation.

What's the problem?

Currently, creating high-quality videos with matching audio from text is difficult. Existing open-source methods don't produce results as good as commercial options like Veo3, often struggling with realistic visuals, keeping the audio and video perfectly timed, and generally matching what people would find appealing.

What's the solution?

The researchers developed JavisDiT++ which uses a few key ideas. First, it uses a 'mixture of experts' approach, meaning different parts of the system specialize in either audio or video to improve quality. Second, it employs a technique called 'temporal-aligned RoPE' to ensure each frame of the video is precisely synchronized with the corresponding audio. Finally, they used a method called 'direct preference optimization' to train the system to create videos that people actually prefer based on quality, consistency, and synchronization.

Why it matters?

This work is important because it significantly improves the quality of open-source tools for generating audio and video from text. JavisDiT++ achieves better results than previous open-source methods with a relatively small amount of training data, making it more accessible for researchers and developers to build upon and create their own AI-powered video generation applications.

Abstract

AIGC has rapidly expanded from text-to-image generation toward high-quality multimodal synthesis across video and audio. Within this context, joint audio-video generation (JAVG) has emerged as a fundamental task that produces synchronized and semantically aligned sound and vision from textual descriptions. However, compared with advanced commercial models such as Veo3, existing open-source methods still suffer from limitations in generation quality, temporal synchrony, and alignment with human preferences. To bridge the gap, this paper presents JavisDiT++, a concise yet powerful framework for unified modeling and optimization of JAVG. First, we introduce a modality-specific mixture-of-experts (MS-MoE) design that enables cross-modal interaction efficacy while enhancing single-modal generation quality. Then, we propose a temporal-aligned RoPE (TA-RoPE) strategy to achieve explicit, frame-level synchronization between audio and video tokens. Besides, we develop an audio-video direct preference optimization (AV-DPO) method to align model outputs with human preference across quality, consistency, and synchrony dimensions. Built upon Wan2.1-1.3B-T2V, our model achieves state-of-the-art performance merely with around 1M public training entries, significantly outperforming prior approaches in both qualitative and quantitative evaluations. Comprehensive ablation studies have been conducted to validate the effectiveness of our proposed modules. All the code, model, and dataset are released at https://JavisVerse.github.io/JavisDiT2-page.

View Paper