Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction

Kaisi Guan, Xihua Wang, Zhengfeng Lai, Xin Cheng, Peng Zhang, XiaoJiang Liu, Ruihua Song, Meng Cao

2025-10-10

Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction

Summary

This research tackles the problem of creating videos with synchronized sound directly from text descriptions, a process called Text-to-Sounding-Video (T2SV) generation.

What's the problem?

Generating videos from text and ensuring the audio perfectly matches both the text *and* the video is really hard. Current methods often struggle because they use the same text description for both the video and audio, which can confuse the system. It's like trying to describe a scene and the soundtrack with the same few words – things get mixed up! Also, it's not clear how the video and audio information should best interact with each other during the creation process to stay synchronized.

What's the solution?

The researchers developed a two-part system. First, they created a way to generate *two* different text descriptions: one specifically for the video content and another for the audio. This avoids the confusion caused by a single shared description. Then, they built a new type of system, called BridgeDiT, that uses a special mechanism called Dual CrossAttention. This allows the video and audio parts to 'talk' to each other back and forth, ensuring they stay in sync both in terms of what's happening and when it's happening, creating a seamless video with matching sound.

Why it matters?

This work is important because it significantly improves the quality of videos generated from text. By solving the problems of text interference and cross-modal synchronization, the researchers have created a system that produces more realistic and coherent videos with perfectly matched audio, bringing us closer to being able to automatically create videos from just a text description.

Abstract

This study focuses on a challenging yet promising task, Text-to-Sounding-Video (T2SV) generation, which aims to generate a video with synchronized audio from text conditions, meanwhile ensuring both modalities are aligned with text. Despite progress in joint audio-video training, two critical challenges still remain unaddressed: (1) a single, shared text caption where the text for video is equal to the text for audio often creates modal interference, confusing the pretrained backbones, and (2) the optimal mechanism for cross-modal feature interaction remains unclear. To address these challenges, we first propose the Hierarchical Visual-Grounded Captioning (HVGC) framework that generates pairs of disentangled captions, a video caption, and an audio caption, eliminating interference at the conditioning stage. Based on HVGC, we further introduce BridgeDiT, a novel dual-tower diffusion transformer, which employs a Dual CrossAttention (DCA) mechanism that acts as a robust ``bridge" to enable a symmetric, bidirectional exchange of information, achieving both semantic and temporal synchronization. Extensive experiments on three benchmark datasets, supported by human evaluations, demonstrate that our method achieves state-of-the-art results on most metrics. Comprehensive ablation studies further validate the effectiveness of our contributions, offering key insights for the future T2SV task. All the codes and checkpoints will be publicly released.

View Paper