UniVideo: Unified Understanding, Generation, and Editing for Videos

Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhu Chen

2025-10-10

UniVideo: Unified Understanding, Generation, and Editing for Videos

Summary

This paper introduces UniVideo, a new AI model that can create and edit videos based on text and image instructions, similar to how existing models work with images but now extended to video.

What's the problem?

Current AI models that generate content from multiple types of inputs, like text and images, are really good at working with images, but they haven't been effectively applied to videos. Creating videos is more complex because you need to maintain consistency across many frames, and it's hard to get the AI to understand complex instructions for video editing.

What's the solution?

The researchers built UniVideo using two main parts: a 'brain' that understands instructions (called a Multimodal Large Language Model) and a 'creator' that actually generates the video (called a Multimodal DiT). The 'brain' figures out what the instructions mean, and then tells the 'creator' how to make the video, ensuring it looks good and makes sense. They trained this system to do lots of different video tasks all at once, so it can handle a wide range of requests. It can even combine tasks, like editing a video *and* changing its style.

Why it matters?

UniVideo is important because it's a big step towards AI that can easily create and edit videos based on what people want. It's not just good at specific tasks; it can generalize to new, unseen editing requests, even ones it wasn't specifically trained for, like adding a character to a video or changing the material of an object. This opens up possibilities for easier video creation and editing for everyone, and the researchers are sharing their model and code to help others build on this work.

Abstract

Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.

View Paper