Repurposing Geometric Foundation Models for Multi-view Diffusion
Wooseok Jang, Seonghu Jeon, Jisang Han, Jinhyeok Choi, Minkyung Kwon, Seungryong Kim, Saining Xie, Sainan Liu
2026-03-24
Summary
This paper explores a new way to create images of an object from different viewpoints, like rotating a 3D model on your screen. It focuses on improving the quality and consistency of these generated views.
What's the problem?
Currently, when computers try to generate images from different angles, they often struggle to maintain a consistent, geometrically accurate representation. Existing methods use a 'latent space' – essentially a compressed representation of the image – that doesn't specifically focus on the 3D structure of the object, leading to inconsistencies when changing viewpoints. Think of it like trying to sculpt something without fully understanding its shape.
What's the solution?
The researchers developed a new framework called Geometric Latent Diffusion (GLD). Instead of using a standard image-based latent space, GLD uses a latent space built from features that *already* understand the 3D geometry of objects. They take advantage of 'geometric foundation models' which are good at understanding shapes. This allows the system to generate new views that are more accurate and consistent because it's starting with a solid understanding of the object's form. They then use a 'diffusion' process to create the images, similar to how noise is gradually removed to reveal a clear picture.
Why it matters?
This work is important because it significantly improves the quality and speed of generating images from different viewpoints. GLD performs better than previous methods and, surprisingly, can achieve competitive results even without relying on massive amounts of pre-existing image data for training. This could have big implications for applications like creating 3D models from photos, virtual reality, and robotics where accurate view synthesis is crucial.
Abstract
While recent advances in generative latent spaces have driven substantial progress in single-image generation, the optimal latent space for novel view synthesis (NVS) remains largely unexplored. In particular, NVS requires geometrically consistent generation across viewpoints, but existing approaches typically operate in a view-independent VAE latent space. In this paper, we propose Geometric Latent Diffusion (GLD), a framework that repurposes the geometrically consistent feature space of geometric foundation models as the latent space for multi-view diffusion. We show that these features not only support high-fidelity RGB reconstruction but also encode strong cross-view geometric correspondences, providing a well-suited latent space for NVS. Our experiments demonstrate that GLD outperforms both VAE and RAE on 2D image quality and 3D consistency metrics, while accelerating training by more than 4.4x compared to the VAE latent space. Notably, GLD remains competitive with state-of-the-art methods that leverage large-scale text-to-image pretraining, despite training its diffusion model from scratch without such generative pretraining.