Plan-X: Instruct Video Generation via Semantic Planning
Lun Huang, You Xie, Hongyi Xu, Tianpei Gu, Chenxu Zhang, Guoxian Song, Zenan Li, Xiaochen Zhao, Linjie Luo, Guillermo Sapiro
2025-11-25
Summary
This paper introduces a new method, Plan-X, for creating videos from text descriptions, aiming to make the videos more accurate and aligned with what the user wants.
What's the problem?
Current video generation models, while good at creating realistic images, often struggle with understanding the bigger picture and planning out a sequence of events. This leads to videos that don't quite make sense, have things in them that shouldn't be there (visual hallucinations), or don't follow the instructions given, especially when the instructions are complex or involve interactions between people and objects.
What's the solution?
Plan-X solves this by adding a 'Semantic Planner' to the process. This planner is a type of AI that takes the text description and any existing images as input and creates a step-by-step plan, represented as a series of 'semantic tokens'. Think of these tokens as a detailed storyboard. This storyboard then guides the video generation process, helping the model create a video that's more logical and consistent with the original instructions. It combines the planning ability of language models with the detailed image creation ability of diffusion models.
Why it matters?
This work is important because it improves the quality and reliability of AI-generated videos. By reducing hallucinations and ensuring the videos accurately reflect the user's intent, it opens up possibilities for more useful and creative applications, like creating instructional videos, animations, or even personalized stories.
Abstract
Diffusion Transformers have demonstrated remarkable capabilities in visual synthesis, yet they often struggle with high-level semantic reasoning and long-horizon planning. This limitation frequently leads to visual hallucinations and mis-alignments with user instructions, especially in scenarios involving complex scene understanding, human-object interactions, multi-stage actions, and in-context motion reasoning. To address these challenges, we propose Plan-X, a framework that explicitly enforces high-level semantic planning to instruct video generation process. At its core lies a Semantic Planner, a learnable multimodal language model that reasons over the user's intent from both text prompts and visual context, and autoregressively generates a sequence of text-grounded spatio-temporal semantic tokens. These semantic tokens, complementary to high-level text prompt guidance, serve as structured "semantic sketches" over time for the video diffusion model, which has its strength at synthesizing high-fidelity visual details. Plan-X effectively integrates the strength of language models in multimodal in-context reasoning and planning, together with the strength of diffusion models in photorealistic video synthesis. Extensive experiments demonstrate that our framework substantially reduces visual hallucinations and enables fine-grained, instruction-aligned video generation consistent with multimodal context.