4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
Chiao-An Yang, Ryo Hachiuma, Sifei Liu, Subhashree Radhakrishnan, Raymond A. Yeh, Yu-Chiang Frank Wang, Min-Hung Chen
2025-12-22
Summary
This paper focuses on improving how well artificial intelligence, specifically large language models that can understand both images and text, can process and understand videos that change over time and have three-dimensional depth.
What's the problem?
Current AI models struggle with understanding videos because they have trouble grasping how things move and change in 3D space over time. They often treat videos as a series of still images and don't fully appreciate the dynamic relationships between objects. Also, existing tests for these models don't really challenge them to understand specific areas within a video scene.
What's the solution?
The researchers created a new AI model called 4D-RGPT, which is specifically designed to better understand videos by focusing on both the 3D structure and how things change over time. They trained this model using a technique called Perceptual 4D Distillation, which essentially transfers knowledge about 3D and time from a more experienced AI to the new one. Finally, they built a new, more challenging test set called R4D-Bench that requires the AI to understand specific regions within dynamic 3D scenes.
Why it matters?
This work is important because it pushes the boundaries of what AI can understand about the real world. Better understanding of videos has many potential applications, like self-driving cars needing to interpret scenes, robots interacting with their environment, and even more advanced video analysis for various industries.
Abstract
Despite advances in Multimodal LLMs (MLLMs), their ability to reason over 3D structures and temporal dynamics remains limited, constrained by weak 4D perception and temporal understanding. Existing 3D and 4D Video Question Answering (VQA) benchmarks also emphasize static scenes and lack region-level prompting. We tackle these issues by introducing: (a) 4D-RGPT, a specialized MLLM designed to capture 4D representations from video inputs with enhanced temporal perception; (b) Perceptual 4D Distillation (P4D), a training framework that transfers 4D representations from a frozen expert model into 4D-RGPT for comprehensive 4D perception; and (c) R4D-Bench, a benchmark for depth-aware dynamic scenes with region-level prompting, built via a hybrid automated and human-verified pipeline. Our 4D-RGPT achieves notable improvements on both existing 4D VQA benchmarks and the proposed R4D-Bench benchmark.