High-Fidelity Novel View Synthesis via Splatting-Guided Diffusion

Xiang Zhang, Yang Zhang, Lukas Mehl, Markus Gross, Christopher Schroers

2025-02-20

High-Fidelity Novel View Synthesis via Splatting-Guided Diffusion

Summary

This paper talks about SplatDiff, a new way to create high-quality images of objects or scenes from different angles, even when you only have one original picture to work with. It's like giving a computer the ability to imagine what something looks like from the side or back after only seeing it from the front.

What's the problem?

Current methods for creating new views of objects or scenes from limited information often have issues. Some methods make the shape of objects look weird, while others might add textures or details that weren't actually there in the original image. It's like trying to draw the back of a house when you've only seen the front - you might get the general shape right, but the details could be way off.

What's the solution?

The researchers created SplatDiff, which combines two different techniques. One technique is good at getting the overall shape and position right, while the other is good at filling in realistic details. They also added a special part called a 'texture bridge' that helps make sure the details in the new views match the original image. It's like giving the computer both a rough sketch of what the new view should look like and a detailed reference of the textures and colors to use.

Why it matters?

This matters because it could make virtual reality, video games, and special effects in movies look much more realistic. With just one picture, we could create convincing views from any angle. This could also help in fields like architecture or product design, where you might want to see how something looks from different angles without having to physically move around it or build it first. It's a big step towards making computers better at understanding and recreating the 3D world around us.

Abstract

Despite recent advances in Novel View Synthesis (NVS), generating high-fidelity views from single or sparse observations remains a significant challenge. Existing splatting-based approaches often produce distorted geometry due to splatting errors. While diffusion-based methods leverage rich 3D priors to achieve improved geometry, they often suffer from texture hallucination. In this paper, we introduce SplatDiff, a pixel-splatting-guided video diffusion model designed to synthesize high-fidelity novel views from a single image. Specifically, we propose an aligned synthesis strategy for precise control of target viewpoints and geometry-consistent view synthesis. To mitigate texture hallucination, we design a texture bridge module that enables high-fidelity texture generation through adaptive feature fusion. In this manner, SplatDiff leverages the strengths of splatting and diffusion to generate novel views with consistent geometry and high-fidelity details. Extensive experiments verify the state-of-the-art performance of SplatDiff in single-view NVS. Additionally, without extra training, SplatDiff shows remarkable zero-shot performance across diverse tasks, including sparse-view NVS and stereo video conversion.

View Paper