CutClaw: Agentic Hours-Long Video Editing via Music Synchronization
Shifang Zhao, Yihan Hu, Ying Shan, Yunchao Wei, Xiaodong Cun
2026-04-01
Summary
This paper introduces CutClaw, a new system that automatically edits long videos into shorter, more engaging clips with music synchronization, essentially acting as an AI video editor.
What's the problem?
Currently, editing videos, especially long ones, is a really time-consuming and repetitive task for video creators and filmmakers. It takes a lot of effort to find the best parts, add music that fits, and make everything look good, and it's a process that isn't easily sped up.
What's the solution?
CutClaw uses a team of AI 'agents' powered by Multimodal Language Models, which are good at understanding both images and text. First, it breaks down the video and audio to understand the content. Then, a 'Playwriter' agent plans the overall story and how the music should fit. Finally, 'Editor' and 'Reviewer' agents work together to pick the best visual clips and refine the final video based on what looks and feels right, ensuring the video matches the music and tells a coherent story.
Why it matters?
This is important because it could significantly speed up the video creation process, allowing filmmakers and content creators to produce more videos in less time. It also means people without professional editing skills could easily create high-quality videos, opening up creative possibilities for more people.
Abstract
Editing the video content with audio alignment forms a digital human-made art in current social media. However, the time-consuming and repetitive nature of manual video editing has long been a challenge for filmmakers and professional content creators alike. In this paper, we introduce CutClaw, an autonomous multi-agent framework designed to edit hours-long raw footage into meaningful short videos that leverages the capabilities of multiple Multimodal Language Models~(MLLMs) as an agent system. It produces videos with synchronized music, followed by instructions, and a visually appealing appearance. In detail, our approach begins by employing a hierarchical multimodal decomposition that captures both fine-grained details and global structures across visual and audio footage. Then, to ensure narrative consistency, a Playwriter Agent orchestrates the whole storytelling flow and structures the long-term narrative, anchoring visual scenes to musical shifts. Finally, to construct a short edited video, Editor and Reviewer Agents collaboratively optimize the final cut via selecting fine-grained visual content based on rigorous aesthetic and semantic criteria. We conduct detailed experiments to demonstrate that CutClaw significantly outperforms state-of-the-art baselines in generating high-quality, rhythm-aligned videos. The code is available at: https://github.com/GVCLab/CutClaw.