Generative World Renderer

Zheng-Hui Huang, Zhixiang Wang, Jiaming Tan, Ruihan Yu, Yidan Zhang, Bo Zheng, Yu-Lun Liu, Yung-Yu Chuang, Kaipeng Zhang

2026-04-03

Summary

This research focuses on improving how computers understand and create realistic images and videos, specifically by addressing the gap between computer-generated content and the real world.

What's the problem?

Currently, training computers to accurately render images – both creating images from descriptions (forward rendering) and figuring out the 3D scene from an image (inverse rendering) – is difficult because the datasets used for training aren't realistic enough and don't show how things change over time. Existing datasets lack the complexity and dynamic qualities found in real-world visuals, hindering the development of truly believable computer graphics.

What's the solution?

The researchers created a massive new dataset by recording over 4 million frames of high-quality video from visually rich video games. They used a special technique to capture not just the color of each pixel, but also information about the materials and lighting in the scene. This dataset allows computers to learn how light interacts with different surfaces and how scenes change, improving both forward and inverse rendering. They also developed a way to automatically evaluate how well these rendering techniques perform, using a large language model to judge the consistency of the generated images.

Why it matters?

This work is important because it allows for more realistic and controllable image and video generation. It means computers can better understand the real world from images, and it opens up possibilities for editing the visual style of games and other content using simple text commands. Ultimately, this research brings us closer to creating truly immersive and interactive virtual experiences.

Abstract

Scaling generative inverse and forward rendering to real-world scenarios is bottlenecked by the limited realism and temporal coherence of existing synthetic datasets. To bridge this persistent domain gap, we introduce a large-scale, dynamic dataset curated from visually complex AAA games. Using a novel dual-screen stitched capture method, we extracted 4M continuous frames (720p/30 FPS) of synchronized RGB and five G-buffer channels across diverse scenes, visual effects, and environments, including adverse weather and motion-blur variants. This dataset uniquely advances bidirectional rendering: enabling robust in-the-wild geometry and material decomposition, and facilitating high-fidelity G-buffer-guided video generation. Furthermore, to evaluate the real-world performance of inverse rendering without ground truth, we propose a novel VLM-based assessment protocol measuring semantic, spatial, and temporal consistency. Experiments demonstrate that inverse renderers fine-tuned on our data achieve superior cross-dataset generalization and controllable generation, while our VLM evaluation strongly correlates with human judgment. Combined with our toolkit, our forward renderer enables users to edit styles of AAA games from G-buffers using text prompts.

View Paper