Aligned Novel View Image and Geometry Synthesis via Cross-modal Attention Instillation

Min-Seop Kwak, Junho Kim, Sangdoo Yun, Dongyoon Han, Taekyoung Kim, Seungryong Kim, Jin-Hwa Kim

2025-06-16

Aligned Novel View Image and Geometry Synthesis via Cross-modal
Attention Instillation

Summary

This paper talks about a new AI method that uses a diffusion-based framework to create both images and their 3D shapes (geometry) from different viewpoints in a way that makes them perfectly aligned. It uses a technique called warping-and-inpainting and a special process called cross-modal attention distillation, which lets the image generation part help guide the geometry part to match better. The paper also introduces mesh conditioning techniques that use depth and surface information to improve the results.

What's the problem?

The problem is that previous methods for making new views of images and their corresponding 3D shapes either need a lot of images with exact poses or have trouble aligning the generated images and geometry well when created separately. This causes inconsistencies where the picture and the 3D shape don’t match, making it hard to get high-quality results for both at the same time.

What's the solution?

The solution is building a system that treats creating new views of images and their geometry as tasks of filling in missing parts (inpainting) guided by warping existing views. The key idea is to share attention maps from the image generation model with the geometry generation model during training and use them during generation, which helps both parts learn together to produce aligned and consistent results. Additionally, they use proximity-based mesh conditioning to blend depth and surface normal signals, improving the geometry’s accuracy and filtering out errors.

Why it matters?

This matters because it allows AI to create new, highly detailed views of scenes with matching images and 3D shapes even when only a few reference images are available. It improves applications like 3D reconstruction, virtual reality, gaming, and any technology that needs accurate 3D models paired with realistic images, making these tasks faster, more reliable, and accessible.

Abstract

A diffusion-based framework generates aligned novel views of images and geometry using warping-and-inpainting with cross-modal attention distillation and proximity-based mesh conditioning, achieving high-fidelity synthesis and 3D completion.

View Paper