VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction

Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, Rakesh Ranjan

2025-05-28

VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D
Reconstruction

Summary

This paper talks about VLM-3R, a new system that helps AI models better understand both pictures and words by teaching them to build 3D models from regular video and follow instructions at the same time.

What's the problem?

The problem is that most AI models have trouble truly understanding the 3D shape and movement of objects in videos when they only see flat, 2D images, which makes it hard for them to reason about space and time like humans do.

What's the solution?

To solve this, the researchers created VLM-3R, which trains the AI to use instructions to build 3D models from single-camera video frames. This helps the model see and reason about objects in a more realistic way, including how things move and interact over time.

Why it matters?

This is important because it lets AI better understand the real world, which can help in areas like robotics, virtual reality, and any technology that needs to see and think about objects in three dimensions.

Abstract

VLM-3R, a framework for Vision-Language Models, incorporates 3D reconstructive instruction tuning to process monocular video frames and perform embodied reasoning with robust visual-spatial and temporal contextual understanding.

View Paper