Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models
Wenyue Chen, Wenjue Chen, Peng Li, Qinghe Wang, Xu Jia, Heliang Zheng, Rongfei Jia, Yuan Liu, Ronggang Wang
2026-03-30
Summary
This paper introduces a new method called Know3D for creating 3D models from images, focusing on improving the often-unpredictable back sides of those models.
What's the problem?
Currently, when computers try to create 3D models from a single picture, the parts you *can't* see in the picture – like the back of an object – are often guessed at randomly. This can lead to unrealistic or unwanted shapes, and it's hard to control what the computer creates because it lacks a strong understanding of what *should* be there. Existing models struggle with making those unseen parts logically consistent with the visible parts and user requests.
What's the solution?
The researchers developed Know3D, which combines the power of large language models (the kind that power chatbots) with 3D generation technology. Essentially, they use the language model to 'understand' what the back of the object should look like based on a text description, and then use that understanding to guide the 3D model's creation process. They use a system where the language model provides the ideas, and a 'diffusion model' translates those ideas into actual 3D geometry, creating a more controlled and sensible back view.
Why it matters?
This work is important because it moves 3D generation closer to being truly controllable. Instead of getting random results for the unseen parts of a 3D model, users can now use language to specify what they want, leading to more realistic and useful 3D creations. It suggests a new direction for building 3D models that better understand and respond to human instructions.
Abstract
Recent advances in 3D generation have improved the fidelity and geometric details of synthesized 3D assets. However, due to the inherent ambiguity of single-view observations and the lack of robust global structural priors caused by limited 3D training data, the unseen regions generated by existing models are often stochastic and difficult to control, which may sometimes fail to align with user intentions or produce implausible geometries. In this paper, we propose Know3D, a novel framework that incorporates rich knowledge from multimodal large language models into 3D generative processes via latent hidden-state injection, enabling language-controllable generation of the back-view for 3D assets. We utilize a VLM-diffusion-based model, where the VLM is responsible for semantic understanding and guidance. The diffusion model acts as a bridge that transfers semantic knowledge from the VLM to the 3D generation model. In this way, we successfully bridge the gap between abstract textual instructions and the geometric reconstruction of unobserved regions, transforming the traditionally stochastic back-view hallucination into a semantically controllable process, demonstrating a promising direction for future 3D generation models.