VABench: A Comprehensive Benchmark for Audio-Video Generation
Daili Hua, Xizhi Wang, Bohan Zeng, Xinyi Huang, Hao Liang, Junbo Niu, Xinlong Chen, Quanqing Xu, Wentao Zhang
2025-12-18
Summary
This paper introduces a new way to thoroughly test how well artificial intelligence creates videos *with* matching sound, not just visually impressive videos.
What's the problem?
Currently, we have good ways to judge how realistic a generated video *looks*, but there aren't good tests to see if the sound in the video actually matches what's happening on screen or what the video is supposed to be about. Existing tests don't really check if the audio and video are properly synchronized or make sense together.
What's the solution?
The researchers created a benchmark called VABench. This benchmark tests AI models in three ways: creating a video and sound from text, creating a video and sound from an image, and creating sound to go with an existing video. VABench looks at 15 different things, like how well the sound matches the video, if lip movements match the speech, and if the AI can answer questions about both the audio and video content. They tested these things across seven different types of video content, like animals, music, and complex scenes.
Why it matters?
This new benchmark is important because it provides a standard way to measure how well AI models can create realistic and synchronized audio-video content. This will help researchers improve these models and push the field forward, leading to more believable and useful AI-generated videos with sound.
Abstract
Recent advances in video generation have been remarkable, enabling models to produce visually compelling videos with synchronized audio. While existing video generation benchmarks provide comprehensive metrics for visual quality, they lack convincing evaluations for audio-video generation, especially for models aiming to generate synchronized audio-video outputs. To address this gap, we introduce VABench, a comprehensive and multi-dimensional benchmark framework designed to systematically evaluate the capabilities of synchronous audio-video generation. VABench encompasses three primary task types: text-to-audio-video (T2AV), image-to-audio-video (I2AV), and stereo audio-video generation. It further establishes two major evaluation modules covering 15 dimensions. These dimensions specifically assess pairwise similarities (text-video, text-audio, video-audio), audio-video synchronization, lip-speech consistency, and carefully curated audio and video question-answering (QA) pairs, among others. Furthermore, VABench covers seven major content categories: animals, human sounds, music, environmental sounds, synchronous physical sounds, complex scenes, and virtual worlds. We provide a systematic analysis and visualization of the evaluation results, aiming to establish a new standard for assessing video generation models with synchronous audio capabilities and to promote the comprehensive advancement of the field.