Video-As-Prompt: Unified Semantic Control for Video Generation
Yuxuan Bian, Xin Chen, Zenan Li, Tiancheng Zhi, Shen Sang, Linjie Luo, Qiang Xu
2025-10-27
Summary
This paper introduces a new way to create videos from prompts, focusing on making the videos follow specific instructions without losing quality or needing to be retrained for each new instruction.
What's the problem?
Currently, making videos based on what you want is hard. Existing methods either create noticeable flaws in the video while trying to follow instructions, or they only work well for the specific instructions they were trained on and can't easily adapt to new ones. It's difficult to have a single video generation system that can reliably do *anything* you ask it to do.
What's the solution?
The researchers developed a system called Video-As-Prompt (VAP). Instead of trying to directly control every pixel, VAP uses a reference video as an example – essentially showing the system *what* you want, rather than *telling* it. It uses a powerful existing video generation model (DiT) and adds a special component (MoT) that allows it to learn from the example video without forgetting what it already knows. They also created a large dataset, VAP-Data, with over 100,000 videos to help train and test this system.
Why it matters?
This work is important because it gets us closer to having a general-purpose video generator. VAP performs as well as, or better than, specialized systems, but it's much more flexible and can handle a wider range of instructions without needing to be retrained. This means we're one step closer to being able to easily create videos of anything we can imagine.
Abstract
Unified, generalizable semantic control in video generation remains a critical open challenge. Existing methods either introduce artifacts by enforcing inappropriate pixel-wise priors from structure-based controls, or rely on non-generalizable, condition-specific finetuning or task-specific architectures. We introduce Video-As-Prompt (VAP), a new paradigm that reframes this problem as in-context generation. VAP leverages a reference video as a direct semantic prompt, guiding a frozen Video Diffusion Transformer (DiT) via a plug-and-play Mixture-of-Transformers (MoT) expert. This architecture prevents catastrophic forgetting and is guided by a temporally biased position embedding that eliminates spurious mapping priors for robust context retrieval. To power this approach and catalyze future research, we built VAP-Data, the largest dataset for semantic-controlled video generation with over 100K paired videos across 100 semantic conditions. As a single unified model, VAP sets a new state-of-the-art for open-source methods, achieving a 38.7% user preference rate that rivals leading condition-specific commercial models. VAP's strong zero-shot generalization and support for various downstream applications mark a significant advance toward general-purpose, controllable video generation.