In-Video Instructions: Visual Signals as Generative Control
Gongfan Fang, Xinyin Ma, Xinchao Wang
2025-11-25
Summary
This paper explores a new way to tell video-generating AI what to do, moving beyond just typing instructions. It focuses on embedding instructions *directly into* the video itself, like drawing arrows or writing text on the screen to guide the AI.
What's the problem?
Currently, controlling what AI generates in videos relies on text prompts. These prompts can be vague and apply to the whole video, making it hard to precisely control what individual objects or parts of the scene do. It's difficult to give specific, localized instructions to different elements within the video.
What's the solution?
The researchers found that powerful video AI models can actually 'read' and follow instructions that are visually present in the video frames. They tested this by adding things like arrows showing movement, or text labels indicating actions, directly onto existing video frames. The AI then used these visual cues to generate future frames that followed those instructions, working with models like Veo, Kling, and Wan.
Why it matters?
This is a big step because it allows for much more precise and intuitive control over video generation. Instead of struggling to describe what you want with words, you can simply *show* the AI what you want to happen, making it easier to create complex and customized videos, especially when dealing with multiple objects and actions.
Abstract
Large-scale video generative models have recently demonstrated strong visual capabilities, enabling the prediction of future frames that adhere to the logical and physical cues in the current observation. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction. In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects. Extensive experiments on three state-of-the-art generators, including Veo 3.1, Kling 2.5, and Wan 2.2, show that video models can reliably interpret and execute such visually embedded instructions, particularly in complex multi-object scenarios.