SAM 3D: 3Dfy Anything in Images
SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Dollár, Georgia Gkioxari, Matt Feiszli
2025-11-21
Summary
This paper introduces SAM 3D, a new computer program that can create 3D models of objects from just a single picture. It focuses on making these models look realistic, even when parts of the object are hidden or the picture is messy with lots going on in the background.
What's the problem?
Creating accurate 3D models from images is really hard, especially with real-world photos. Usually, objects are partially blocked, the lighting is bad, or there's a lot of clutter in the scene. Existing methods struggle with these situations because they need a huge amount of 3D data to learn from, and getting that data is expensive and time-consuming. It's a 'data barrier' preventing better 3D reconstruction.
What's the solution?
The researchers built SAM 3D and trained it using a clever approach. First, they created a large dataset of 3D objects with detailed information about their shape, texture, and position. They did this by combining computer-generated images with real-world images, using both humans and the program itself to help label the data. Then, they used a multi-step training process to teach SAM 3D to predict the 3D structure from a single image, starting with simpler, artificial data and then refining it with real-world examples.
Why it matters?
SAM 3D is a big step forward because it creates much better 3D models from single images than previous methods, as shown by people preferring its results in tests. This has potential applications in many areas, like virtual reality, robotics, and even creating 3D content for games or movies. The researchers are also sharing their code and data, which will help other researchers build on their work and improve 3D reconstruction technology.
Abstract
We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.