Key Features

Chain-of-Thought (CoT) reasoning for audio generation and editing
Stepwise, interactive audio generation and editing for videos
Foundational foley generation for semantically coherent soundscapes
Interactive object-centric refinement through precise user interactions
Targeted editing guided by natural language instructions
Multimodal large language model for contextually aligned CoT reasoning
Unified audio foundation model for audio generation and editing
AudioCoT dataset with structured reasoning annotations

ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics, and excels in out-of-distribution Movie Gen Audio benchmark. The framework outperforms all baselines across most objective metrics and all subjective metrics, and achieves substantial improvements in audio quality and semantic alignment compared to the strongest baseline. ThinkSound also introduces AudioCoT, a comprehensive dataset with structured reasoning annotations that establishes connections between visual content, textual descriptions, and sound synthesis.


ThinkSound's effectiveness is demonstrated through comprehensive ablation studies, which investigate the contribution of each component in the framework and validate the effectiveness of design choices. The studies focus on text encoding strategies and multi-modal integration mechanisms, and show that CoT reasoning substantially improves audio fidelity and that integrating contrastive features from CLIP with contextual reasoning from T5 further improves performance. The framework also compares three model sizes and shows that the Large model achieves the best performance across all metrics.

Get more likes & reach the top of search results by adding this button on your site!

Embed button preview - Light theme
Embed button preview - Dark theme
TurboType Banner

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!