Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation

Min-Jung Kim, Jeongho Kim, Hoiyeong Jin, Junha Hyung, Jaegul Choo

2025-12-23

Infinite-Homography as Robust Conditioning for Camera-Controlled Video Generation

Summary

This paper introduces a new method, called InfCam, for creating videos from different viewpoints, specifically giving creators more control over the 'camera' in post-production. It focuses on generating realistic videos when you change the camera angle, even for scenes with moving objects.

What's the problem?

Currently, making videos from new viewpoints is tricky. Existing methods either rely on guessing the depth of objects in the scene (which can be inaccurate) or need a lot of example videos with different camera movements to learn from. If the depth estimation is wrong, the new view looks distorted, and if there aren't enough example camera paths, the generated videos aren't very diverse or realistic. Essentially, it's hard to get the camera angle exactly right while still making a high-quality, believable video.

What's the solution?

InfCam solves this by avoiding depth estimation altogether. Instead, it uses a clever technique called 'infinite homography warping' which directly translates camera rotations into the video generation process. This means the system understands exactly how the camera is moving. They also created a way to expand existing datasets of videos to include a wider variety of camera movements and lens settings, giving the model more to learn from. The system learns to predict how things should shift in the new view based on the camera rotation, resulting in a more accurate and visually appealing video.

Why it matters?

This research is important because it makes it easier for filmmakers and video editors to manipulate camera angles *after* a scene has been filmed. This opens up possibilities for creative control and fixing mistakes without needing to reshoot. By creating more realistic and accurate novel views, InfCam could be used in virtual reality, special effects, and other applications where generating videos from different perspectives is crucial.

Abstract

Recent progress in video diffusion models has spurred growing interest in camera-controlled novel-view video generation for dynamic scenes, aiming to provide creators with cinematic camera control capabilities in post-production. A key challenge in camera-controlled video generation is ensuring fidelity to the specified camera pose, while maintaining view consistency and reasoning about occluded geometry from limited observations. To address this, existing methods either train trajectory-conditioned video generation model on trajectory-video pair dataset, or estimate depth from the input video to reproject it along a target trajectory and generate the unprojected regions. Nevertheless, existing methods struggle to generate camera-pose-faithful, high-quality videos for two main reasons: (1) reprojection-based approaches are highly susceptible to errors caused by inaccurate depth estimation; and (2) the limited diversity of camera trajectories in existing datasets restricts learned models. To address these limitations, we present InfCam, a depth-free, camera-controlled video-to-video generation framework with high pose fidelity. The framework integrates two key components: (1) infinite homography warping, which encodes 3D camera rotations directly within the 2D latent space of a video diffusion model. Conditioning on this noise-free rotational information, the residual parallax term is predicted through end-to-end training to achieve high camera-pose fidelity; and (2) a data augmentation pipeline that transforms existing synthetic multiview datasets into sequences with diverse trajectories and focal lengths. Experimental results demonstrate that InfCam outperforms baseline methods in camera-pose accuracy and visual fidelity, generalizing well from synthetic to real-world data. Link to our project page:https://emjay73.github.io/InfCam/

View Paper