SViM3D: Stable Video Material Diffusion for Single Image 3D Generation

Andreas Engelhardt, Mark Boss, Vikram Voletti, Chun-Han Yao, Hendrik P. A. Lensch, Varun Jampani

2025-10-10

SViM3D: Stable Video Material Diffusion for Single Image 3D Generation

Summary

This paper introduces a new system called SViM3D that can create realistic 3D models of objects, including how they reflect light, from just a single image.

What's the problem?

Existing methods for creating 3D models from single images often struggle with accurately representing how light interacts with the object's surface. They either use simplified models for reflectance, which don't look very realistic, or require extra steps to figure out the material properties needed for things like relighting and editing the object's appearance. Basically, making a 3D model that looks good under different lighting conditions is hard.

What's the solution?

The researchers built upon recent advances in video diffusion models – think of them as AI that can generate realistic videos. They modified this AI to not only create different views of the object but also to predict how light should bounce off the surface at each point, defining the object’s material properties and shape. By controlling the 'camera' within the AI, they can generate a 3D model that can be realistically relit and used as a base for creating 3D assets. They also added some clever techniques to improve the quality of the generated models.

Why it matters?

This work is important because it allows for the creation of high-quality, relightable 3D models from a single image, which is much easier than traditional methods. This has big implications for fields like augmented and virtual reality, movie making, and video game development, where realistic 3D assets are crucial.

Abstract

We present Stable Video Materials 3D (SViM3D), a framework to predict multi-view consistent physically based rendering (PBR) materials, given a single image. Recently, video diffusion models have been successfully used to reconstruct 3D objects from a single image efficiently. However, reflectance is still represented by simple material models or needs to be estimated in additional steps to enable relighting and controlled appearance edits. We extend a latent video diffusion model to output spatially varying PBR parameters and surface normals jointly with each generated view based on explicit camera control. This unique setup allows for relighting and generating a 3D asset using our model as neural prior. We introduce various mechanisms to this pipeline that improve quality in this ill-posed setting. We show state-of-the-art relighting and novel view synthesis performance on multiple object-centric datasets. Our method generalizes to diverse inputs, enabling the generation of relightable 3D assets useful in AR/VR, movies, games and other visual media.

View Paper