VACE: All-in-One Video Creation and Editing

Zeyinzi Jiang, Zhen Han, Chaojie Mao, Jingfeng Zhang, Yulin Pan, Yu Liu

2025-03-11

VACE: All-in-One Video Creation and Editing

Summary

This paper talks about VACE, an AI tool that lets users create and edit videos all in one place, like turning text into videos, changing styles, or editing specific parts, without needing separate apps.

What's the problem?

Making consistent videos with AI is hard because editing different parts (like characters or backgrounds) often breaks the flow or looks choppy, and existing tools can’t handle both creation and complex edits together.

What's the solution?

VACE uses a smart interface (VCU) to mix text, images, and editing instructions, and a ‘Context Adapter’ to keep videos smooth and consistent when making changes, even for tricky tasks like replacing objects or expanding scenes.

Why it matters?

This helps creators make professional videos faster for social media, ads, or movies without juggling multiple tools, saving time and improving quality.

Abstract

Diffusion Transformer has demonstrated powerful capability and scalability in generating high-quality images and videos. Further pursuing the unification of generation and editing tasks has yielded significant progress in the domain of image content creation. However, due to the intrinsic demands for consistency across both temporal and spatial dynamics, achieving a unified approach for video synthesis remains challenging. We introduce VACE, which enables users to perform Video tasks within an All-in-one framework for Creation and Editing. These tasks include reference-to-video generation, video-to-video editing, and masked video-to-video editing. Specifically, we effectively integrate the requirements of various tasks by organizing video task inputs, such as editing, reference, and masking, into a unified interface referred to as the Video Condition Unit (VCU). Furthermore, by utilizing a Context Adapter structure, we inject different task concepts into the model using formalized representations of temporal and spatial dimensions, allowing it to handle arbitrary video synthesis tasks flexibly. Extensive experiments demonstrate that the unified model of VACE achieves performance on par with task-specific models across various subtasks. Simultaneously, it enables diverse applications through versatile task combinations. Project page: https://ali-vilab.github.io/VACE-Page/.

View Paper