< Explain other AI papers

JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Rongxin Jiang, Jiebo Luo, Hao Fei, Tat-Seng Chua

2025-04-04

JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical
  Spatio-Temporal Prior Synchronization

Summary

This paper is about creating a new AI model that can generate synchronized audio and video together from a text prompt, making realistic-sounding and looking videos.

What's the problem?

It's hard to make AI generate audio and video at the same time that matches up perfectly, especially in complex situations.

What's the solution?

The researchers built a model called JavisDiT that uses a special way to understand how audio and video relate to each other in space and time. This helps it create synchronized audio and video from a simple text description.

Why it matters?

This work matters because it can lead to better AI tools for creating videos with realistic sound, which could be used in entertainment, education, and other fields.

Abstract

This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Built upon the powerful Diffusion Transformer (DiT) architecture, JavisDiT is able to generate high-quality audio and video content simultaneously from open-ended user prompts. To ensure optimal synchronization, we introduce a fine-grained spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and fine-grained spatio-temporal priors, guiding the synchronization between the visual and auditory components. Furthermore, we propose a new benchmark, JavisBench, consisting of 10,140 high-quality text-captioned sounding videos spanning diverse scenes and complex real-world scenarios. Further, we specifically devise a robust metric for evaluating the synchronization between generated audio-video pairs in real-world complex content. Experimental results demonstrate that JavisDiT significantly outperforms existing methods by ensuring both high-quality generation and precise synchronization, setting a new standard for JAVG tasks. Our code, model, and dataset will be made publicly available at https://javisdit.github.io/.