ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing

Huadai Liu, Jialei Wang, Kaicheng Luo, Wen Wang, Qian Chen, Zhou Zhao, Wei Xue

2025-07-01

ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language
Models for Audio Generation and Editing

Summary

This paper talks about ThinkSound, a new method that uses Chain-of-Thought reasoning to improve how computers generate and edit sounds from videos. It combines information from different sources like videos, text, and audio to create detailed and realistic sounds that match what is seen in the video.

What's the problem?

The problem is that making high-quality audio that perfectly fits all the details and timing of video scenes is very hard. Current methods struggle because sound generation requires understanding complex ideas like how sounds relate to objects in the video and how they change over time.

What's the solution?

The paper presents ThinkSound, which breaks the audio generation task into three steps: first, it creates a basic soundscape for the whole video; then, it lets users click on specific objects in the video to improve their sounds; finally, users can give instructions to edit the sound further. At each step, a special model uses Chain-of-Thought reasoning to explain what it’s doing and guide the sound creation, making it more accurate and interactive.

Why it matters?

This matters because ThinkSound helps create much better audio that matches videos, which can be useful for movies, games, and other multimedia. By using step-by-step reasoning, it makes the sound generation process smarter and easier for users to control.

Abstract

ThinkSound uses Chain-of-Thought reasoning to enhance video-to-audio generation through a multimodal large language model and a unified audio foundation model, achieving top performance in audio metrics and CoT metrics.

View Paper