Plenoptic Video Generation

Xiao Fu, Shitao Tang, Min Shi, Xian Liu, Jinwei Gu, Ming-Yu Liu, Dahua Lin, Chen-Hsuan Lin

2026-01-09

Summary

This paper introduces PlenopticDreamer, a new system for creating videos from a single image and a specified camera path. It builds on existing 'camera-controlled generative video re-rendering' techniques, which let you change the viewpoint of a scene in a video, but improves upon their weaknesses when looking at a scene from multiple angles.

What's the problem?

Current methods for generating videos from a single image struggle to create consistent scenes when the camera moves around to different viewpoints. Because these systems rely on random processes to 'fill in' details, the generated video can appear disjointed or change unexpectedly as the camera angle shifts, lacking a coherent sense of space and time. Essentially, things don't quite line up when you look at the scene from different perspectives.

What's the solution?

PlenopticDreamer solves this by training a model to predict future frames of a video based on previous frames and the desired camera movement. It cleverly uses a 'video retrieval' system to find and use the most relevant parts of previously generated video as a guide, ensuring consistency. The training process also includes techniques to help the model learn faster, avoid errors building up over long videos, and handle extended video generation.

Why it matters?

This research is important because it significantly improves the quality and consistency of videos generated from a single image. This has implications for creating realistic virtual environments, generating training data for robots, and potentially even creating new forms of visual content. The ability to accurately control the camera and maintain visual coherence across different viewpoints is a major step forward in this field.

Abstract

Camera-controlled generative video re-rendering methods, such as ReCamMaster, have achieved remarkable progress. However, despite their success in single-view setting, these works often struggle to maintain consistency across multi-view scenarios. Ensuring spatio-temporal coherence in hallucinated regions remains challenging due to the inherent stochasticity of generative models. To address it, we introduce PlenopticDreamer, a framework that synchronizes generative hallucinations to maintain spatio-temporal memory. The core idea is to train a multi-in-single-out video-conditioned model in an autoregressive manner, aided by a camera-guided video retrieval strategy that adaptively selects salient videos from previous generations as conditional inputs. In addition, Our training incorporates progressive context-scaling to improve convergence, self-conditioning to enhance robustness against long-range visual degradation caused by error accumulation, and a long-video conditioning mechanism to support extended video generation. Extensive experiments on the Basic and Agibot benchmarks demonstrate that PlenopticDreamer achieves state-of-the-art video re-rendering, delivering superior view synchronization, high-fidelity visuals, accurate camera control, and diverse view transformations (e.g., third-person to third-person, and head-view to gripper-view in robotic manipulation). Project page: https://research.nvidia.com/labs/dir/plenopticdreamer/

View Paper