AdaViewPlanner: Adapting Video Diffusion Models for Viewpoint Planning in 4D Scenes
Yu Li, Menghan Xia, Gongye Liu, Jianhong Bai, Xintao Wang, Conglang Zhang, Yuxuan Lin, Ruihang Chu, Pengfei Wan, Yujiu Yang
2025-10-14
Summary
This paper explores using artificial intelligence models that create videos from text descriptions to help plan the best viewpoints for interacting with 3D scenes, like navigating a virtual environment.
What's the problem?
Currently, it's difficult for computers to figure out the best places to 'look' in a 3D world to understand and interact with it effectively. Existing methods struggle to create natural and useful viewpoints. The researchers realized that video generation models already understand how scenes naturally change from different perspectives, so they wanted to see if that knowledge could be applied to viewpoint planning.
What's the solution?
The researchers developed a two-step process. First, they fed information about a 3D scene into a pre-existing text-to-video model, essentially teaching it what the scene looks like. Then, they added a special component that refines the model's understanding of camera angles, making it predict good viewpoints. This refinement process works by 'denoising' the camera's position and orientation, guided by both the generated video and the original 3D scene data. Essentially, they're using the video model to suggest viewpoints and then fine-tuning those suggestions for optimal clarity and interaction.
Why it matters?
This work shows that video generation models aren't just good at making videos; they can also be used to understand and navigate 3D environments. This is a step towards creating more realistic and interactive virtual experiences, and could eventually help robots understand and interact with the real world more effectively.
Abstract
Recent Text-to-Video (T2V) models have demonstrated powerful capability in visual simulation of real-world geometry and physical laws, indicating its potential as implicit world models. Inspired by this, we explore the feasibility of leveraging the video generation prior for viewpoint planning from given 4D scenes, since videos internally accompany dynamic scenes with natural viewpoints. To this end, we propose a two-stage paradigm to adapt pre-trained T2V models for viewpoint prediction, in a compatible manner. First, we inject the 4D scene representation into the pre-trained T2V model via an adaptive learning branch, where the 4D scene is viewpoint-agnostic and the conditional generated video embeds the viewpoints visually. Then, we formulate viewpoint extraction as a hybrid-condition guided camera extrinsic denoising process. Specifically, a camera extrinsic diffusion branch is further introduced onto the pre-trained T2V model, by taking the generated video and 4D scene as input. Experimental results show the superiority of our proposed method over existing competitors, and ablation studies validate the effectiveness of our key technical designs. To some extent, this work proves the potential of video generation models toward 4D interaction in real world.