4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer

Xianfeng Wu, Yajing Bai, Minghan Li, Xianzu Wu, Xueqi Zhao, Zhongyuan Lai, Wenyu Liu, Xinggang Wang

2025-12-05

4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer

Summary

This paper introduces a new way to help computers understand videos and respond to questions about what's happening in them, focusing on connecting what's visually seen with language.

What's the problem?

Currently, systems that understand videos rely on building a detailed 3D model of each scene individually, which takes a lot of computing power and doesn't work well when trying to understand new, different scenes. It's like having to rebuild a Lego castle from scratch every time you want to ask a question about a slightly different castle.

What's the solution?

The researchers created a system called 4DLangVGGT that uses a type of artificial intelligence called a Transformer to process both the visual information and the language at the same time. It learns to connect what things *are* with where and when they *are* in the video, without needing to rebuild a 3D model for every single scene. This allows it to understand videos more efficiently and apply what it learns to new videos more easily.

Why it matters?

This work is important because it makes it more practical to build AI systems that can truly understand and interact with the real world. This has big implications for things like robots, virtual reality, and augmented reality, where AI needs to understand dynamic environments and respond to natural language commands.

Abstract

Constructing 4D language fields is crucial for embodied AI, augmented/virtual reality, and 4D scene understanding, as they provide enriched semantic representations of dynamic environments and enable open-vocabulary querying in complex scenarios. However, existing approaches to 4D semantic field construction primarily rely on scene-specific Gaussian splatting, which requires per-scene optimization, exhibits limited generalization, and is difficult to scale to real-world applications. To address these limitations, we propose 4DLangVGGT, the first Transformer-based feed-forward unified framework for 4D language grounding, that jointly integrates geometric perception and language alignment within a single architecture. 4DLangVGGT has two key components: the 4D Visual Geometry Transformer, StreamVGGT, which captures spatio-temporal geometric representations of dynamic scenes; and the Semantic Bridging Decoder (SBD), which projects geometry-aware features into a language-aligned semantic space, thereby enhancing semantic interpretability while preserving structural fidelity. Unlike prior methods that depend on costly per-scene optimization, 4DLangVGGT can be jointly trained across multiple dynamic scenes and directly applied during inference, achieving both deployment efficiency and strong generalization. This design significantly improves the practicality of large-scale deployment and establishes a new paradigm for open-vocabulary 4D scene understanding. Experiments on HyperNeRF and Neu3D datasets demonstrate that our approach not only generalizes effectively but also achieves state-of-the-art performance, achieving up to 2% gains under per-scene training and 1% improvements under multi-scene training. Our code released in https://github.com/hustvl/4DLangVGGT

View Paper