VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing

Xiangpeng Yang, Linchao Zhu, Hehe Fan, Yi Yang

2025-02-25

VideoGrain: Modulating Space-Time Attention for Multi-grained Video
Editing

Summary

This paper talks about VideoGrain, a new AI technology that makes it easier to edit videos in very specific ways, from changing entire objects to tweaking small details, all without needing to retrain the AI for each new task.

What's the problem?

Current AI video editing tools struggle with making precise changes at different levels of detail. They often have trouble matching text instructions to the right parts of the video and keeping different features of the video separate when making changes.

What's the solution?

The researchers created VideoGrain, which uses a clever trick to control how the AI pays attention to different parts of the video. It makes the AI focus more on the areas that match the editing instructions and less on unrelated parts. This helps the AI make more accurate changes and keep different parts of the video separate when editing.

Why it matters?

This matters because it could make video editing much easier and more powerful. With VideoGrain, people could make complex changes to videos just by describing what they want, without needing special skills or training. This could be huge for filmmakers, social media creators, and anyone who works with video, making it faster and easier to create exactly the content they imagine.

Abstract

Recent advancements in diffusion models have significantly improved video generation and editing capabilities. However, multi-grained video editing, which encompasses class-level, instance-level, and part-level modifications, remains a formidable challenge. The major difficulties in multi-grained editing include semantic misalignment of text-to-region control and feature coupling within the diffusion model. To address these difficulties, we present VideoGrain, a zero-shot approach that modulates space-time (cross- and self-) attention mechanisms to achieve fine-grained control over video content. We enhance text-to-region control by amplifying each local prompt's attention to its corresponding spatial-disentangled region while minimizing interactions with irrelevant areas in cross-attention. Additionally, we improve feature separation by increasing intra-region awareness and reducing inter-region interference in self-attention. Extensive experiments demonstrate our method achieves state-of-the-art performance in real-world scenarios. Our code, data, and demos are available at https://knightyxp.github.io/VideoGrain_project_page/

View Paper