LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion

Fangfu Liu, Hao Li, Jiawei Chi, Hanyang Wang, Minghui Yang, Fudong Wang, Yueqi Duan

2025-07-04

LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with
TriMap Video Diffusion

Summary

This paper talks about LangScene-X, a new AI method that reconstructs 3D scenes with embedded language information using just a few 2D images. It combines visuals, geometry, and meanings to create detailed 3D models that can answer open-ended language questions.

What's the problem?

The problem is that existing 3D reconstruction methods need many images from different angles and complicated calibration, which is not practical and often causes errors or weird results when fewer images are available.

What's the solution?

The researchers developed a TriMap video diffusion model that generates consistent RGB images, surface geometry, and semantic segmentation maps from sparse views. They also created a Language Quantized Compressor to encode language features efficiently, enabling the system to generalize well across different scenes without re-training. They align language information on the 3D surfaces to support flexible queries.

Why it matters?

This matters because LangScene-X makes 3D reconstruction faster and more flexible, even with limited images, and allows natural language interaction with the 3D scenes. This can help in fields like robotics, augmented reality, and digital twins where understanding and interacting with 3D spaces is crucial.

Abstract

LangScene-X uses a TriMap video diffusion model and Language Quantized Compressor to generate 3D consistent multi-modality information from sparse 2D views, enabling open-ended language queries and superior generalization.

View Paper