The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

Hanlin Wang, Hao Ouyang, Qiuyu Wang, Yue Yu, Yihao Meng, Wen Wang, Ka Leong Cheng, Shuailei Ma, Qingyan Bai, Yixuan Li, Cheng Chen, Yanhong Zeng, Xing Zhu, Yujun Shen, Qifeng Chen

2025-12-19

The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

Summary

This paper introduces WorldCanvas, a new system for creating realistic and controllable videos of events happening in a world, letting users direct what happens through a combination of text descriptions, planned movements, and example images.

What's the problem?

Existing methods for generating videos from text often struggle to create videos that are both consistent over time and truly responsive to user control. Simply using text can lead to illogical events, while controlling videos with just movement data doesn't allow for much creative direction. Current systems also have trouble keeping track of objects and their identities throughout the video, especially if they temporarily leave the scene.

What's the solution?

WorldCanvas solves this by combining three types of information: text to define *what* should happen, trajectories to define *how* things move and when, and reference images to define *what* objects should look like. This multimodal approach allows the system to generate videos with coherent action, consistent object appearances, and even surprising events that still make sense within the scene. It essentially creates a 'world' that responds to user instructions.

Why it matters?

This work is important because it moves beyond simply predicting what might happen in a video to creating interactive simulations. Instead of a passive world model, WorldCanvas allows users to actively shape and explore virtual environments, which has potential applications in areas like game development, robotics, and creating training simulations.

Abstract

We present WorldCanvas, a framework for promptable world events that enables rich, user-directed simulation by combining text, trajectories, and reference images. Unlike text-only approaches and existing trajectory-controlled image-to-video methods, our multimodal approach combines trajectories -- encoding motion, timing, and visibility -- with natural language for semantic intent and reference images for visual grounding of object identity, enabling the generation of coherent, controllable events that include multi-agent interactions, object entry/exit, reference-guided appearance and counterintuitive events. The resulting videos demonstrate not only temporal coherence but also emergent consistency, preserving object identity and scene despite temporary disappearance. By supporting expressive world events generation, WorldCanvas advances world models from passive predictors to interactive, user-shaped simulators. Our project page is available at: https://worldcanvas.github.io/.

View Paper