ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics, and excels in out-of-distribution Movie Gen Audio benchmark. The framework outperforms all baselines across most objective metrics and all subjective metrics, and achieves substantial improvements in audio quality and semantic alignment compared to the strongest baseline. ThinkSound also introduces AudioCoT, a comprehensive dataset with structured reasoning annotations that establishes connections between visual content, textual descriptions, and sound synthesis.
ThinkSound's effectiveness is demonstrated through comprehensive ablation studies, which investigate the contribution of each component in the framework and validate the effectiveness of design choices. The studies focus on text encoding strategies and multi-modal integration mechanisms, and show that CoT reasoning substantially improves audio fidelity and that integrating contrastive features from CLIP with contextual reasoning from T5 further improves performance. The framework also compares three model sizes and shows that the Large model achieves the best performance across all metrics.