FrameDiffuser: G-Buffer-Conditioned Diffusion for Neural Forward Frame Rendering

Ole Beisswenger, Jan-Niklas Dihlmann, Hendrik P. A. Lensch

2025-12-19

FrameDiffuser: G-Buffer-Conditioned Diffusion for Neural Forward Frame Rendering

Summary

This paper introduces a new method called FrameDiffuser for creating realistic images in real-time, like in video games, based on information about the shapes and materials in a scene.

What's the problem?

Currently, creating realistic images for interactive applications is difficult because existing methods either don't create smooth, consistent video (they treat each frame separately) or are too slow for typical computers and require processing the entire video sequence before it can even start, which isn't useful when a user's actions change what happens next.

What's the solution?

FrameDiffuser solves this by building on previous frames to create new ones, using information about the scene's geometry and materials. It's like it 'remembers' what the scene looked like before and uses that to make the next frame look realistic and consistent. It uses two techniques, ControlNet and ControlLoRA, to ensure the image structure and timing are correct, and it's trained in stages to make it stable. Importantly, it focuses on making one specific environment look really good rather than trying to be okay at everything.

Why it matters?

This work is important because it allows for much more realistic graphics in interactive applications, like video games, without requiring extremely powerful computers. By focusing on a single environment, it achieves higher quality and faster performance than methods that try to generalize to all possible scenes, paving the way for more immersive and visually appealing experiences.

Abstract

Neural rendering for interactive applications requires translating geometric and material properties (G-buffer) to photorealistic images with realistic lighting on a frame-by-frame basis. While recent diffusion-based approaches show promise for G-buffer-conditioned image synthesis, they face critical limitations: single-image models like RGBX generate frames independently without temporal consistency, while video models like DiffusionRenderer are too computationally expensive for most consumer gaming sets ups and require complete sequences upfront, making them unsuitable for interactive applications where future frames depend on user input. We introduce FrameDiffuser, an autoregressive neural rendering framework that generates temporally consistent, photorealistic frames by conditioning on G-buffer data and the models own previous output. After an initial frame, FrameDiffuser operates purely on incoming G-buffer data, comprising geometry, materials, and surface properties, while using its previously generated frame for temporal guidance, maintaining stable, temporal consistent generation over hundreds to thousands of frames. Our dual-conditioning architecture combines ControlNet for structural guidance with ControlLoRA for temporal coherence. A three-stage training strategy enables stable autoregressive generation. We specialize our model to individual environments, prioritizing consistency and inference speed over broad generalization, demonstrating that environment-specific training achieves superior photorealistic quality with accurate lighting, shadows, and reflections compared to generalized approaches.

View Paper