ThinkSound

Paid Audio Video Editing

LikeWebsite Promote

Key Features

Chain-of-Thought (CoT) reasoning for audio generation and editing

Stepwise, interactive audio generation and editing for videos

Foundational foley generation for semantically coherent soundscapes

Interactive object-centric refinement through precise user interactions

Targeted editing guided by natural language instructions

Multimodal large language model for contextually aligned CoT reasoning

Unified audio foundation model for audio generation and editing

AudioCoT dataset with structured reasoning annotations

ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics, and excels in out-of-distribution Movie Gen Audio benchmark. The framework outperforms all baselines across most objective metrics and all subjective metrics, and achieves substantial improvements in audio quality and semantic alignment compared to the strongest baseline. ThinkSound also introduces AudioCoT, a comprehensive dataset with structured reasoning annotations that establishes connections between visual content, textual descriptions, and sound synthesis.

ThinkSound's effectiveness is demonstrated through comprehensive ablation studies, which investigate the contribution of each component in the framework and validate the effectiveness of design choices. The studies focus on text encoding strategies and multi-modal integration mechanisms, and show that CoT reasoning substantially improves audio fidelity and that integrating contrastive features from CLIP with contextual reasoning from T5 further improves performance. The framework also compares three model sizes and shows that the Large model achieves the best performance across all metrics.

Get more likes & reach the top of search results by adding this button on your site!

ThinkSound

Key Features

Subscribe to the AI Search Newsletter