Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Keming Wu, Zuhao Yang, Kaichen Zhang, Shizun Wang, Haowei Zhu, Sicong Leng, Zhongyu Yang, Qijie Wang, Sudong Wang, Ziting Wang, Zili Wang, Hui Zhang, Haonan Wang, Hang Zhou, Yifan Pu, Xingxuan Li, Fangneng Zhan, Bo Li, Lidong Bing, Yuxin Song, Ziwei Liu, Wenhu Chen
2026-05-01
Summary
This paper discusses the current limitations of AI models that create images, even though they've gotten really good at making things *look* realistic. It argues that these models need to move beyond just making pretty pictures and start actually *understanding* the world they're depicting.
What's the problem?
Current image-generating AI excels at things like making photos look real, adding text to images, and following simple instructions. However, they struggle with understanding how things relate to each other in space, remembering details over time, creating consistent scenes, and understanding cause and effect. For example, an AI might draw a shadow pointing the wrong way or create an object that doesn't make sense in the context of the scene. They focus too much on how things *appear* and not enough on how things *work*.
What's the solution?
The authors propose a way to think about the evolution of these models, breaking it down into five levels, from simple image creation to complex 'world-modeling' where the AI understands and interacts with a virtual environment. They also highlight key areas where technical improvements are needed, like better ways to guide the image creation process, combining understanding and generation into one step, improving how images are represented internally, and refining the training process. They also point out that current methods for judging how well these models are doing often focus too much on how good the image *looks* and not enough on whether it's logically consistent.
Why it matters?
This work is important because it sets a roadmap for the future of image generation. It pushes the field to focus on building AI that doesn't just create visually appealing images, but actually understands the world and can create images that are both realistic and logically sound. This is crucial for applications like virtual reality, robotics, and even just creating helpful and reliable visual content.
Abstract
Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. We argue that the field should move beyond appearance synthesis toward intelligent visual generation: plausible visuals grounded in structure, dynamics, domain knowledge, and causal relations. To frame this shift, we introduce a five-level taxonomy: Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation, progressing from passive renderers to interactive, agentic, world-aware generators. We analyze key technical drivers, including flow matching, unified understanding-and-generation models, improved visual representations, post-training, reward modeling, data curation, synthetic data distillation, and sampling acceleration. We further show that current evaluations often overestimate progress by emphasizing perceptual quality while missing structural, temporal, and causal failures. By combining benchmark review, in-the-wild stress tests, and expert-constrained case studies, this roadmap offers a capability-centered lens for understanding, evaluating, and advancing the next generation of intelligent visual generation systems.