< Explain other AI papers

WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool

Zizun Li, Jianjun Zhou, Yifan Wang, Haoyu Guo, Wenzheng Chang, Yang Zhou, Haoyi Zhu, Junyi Chen, Chunhua Shen, Tong He

2025-09-08

WinT3R: Window-Based Streaming Reconstruction with Camera Token Pool

Summary

This paper introduces WinT3R, a new computer vision model that can quickly and accurately figure out where a camera is and build a detailed 3D map of the surrounding environment in real-time.

What's the problem?

Existing methods for creating 3D maps and tracking camera position face a challenge: they either produce high-quality maps but are too slow for real-time applications, or they are fast but sacrifice the detail and accuracy of the map. It's hard to get both speed and quality at the same time.

What's the solution?

The researchers solved this by using a 'sliding window' approach, which looks at a series of recent images to improve the accuracy of the 3D reconstruction without requiring a lot of extra computing power. They also developed a clever way to represent cameras using a 'token pool,' making the camera position estimates more reliable and efficient. Essentially, they found a way to share information between images effectively and represent cameras in a compact way.

Why it matters?

This work is important because it pushes the boundaries of what's possible in real-time 3D reconstruction. This has implications for things like robotics, self-driving cars, and augmented reality, where it's crucial to understand the environment quickly and accurately. WinT3R achieves better results than previous methods in terms of both quality and speed, making these applications more feasible.

Abstract

We present WinT3R, a feed-forward reconstruction model capable of online prediction of precise camera poses and high-quality point maps. Previous methods suffer from a trade-off between reconstruction quality and real-time performance. To address this, we first introduce a sliding window mechanism that ensures sufficient information exchange among frames within the window, thereby improving the quality of geometric predictions without large computation. In addition, we leverage a compact representation of cameras and maintain a global camera token pool, which enhances the reliability of camera pose estimation without sacrificing efficiency. These designs enable WinT3R to achieve state-of-the-art performance in terms of online reconstruction quality, camera pose estimation, and reconstruction speed, as validated by extensive experiments on diverse datasets. Code and model are publicly available at https://github.com/LiZizun/WinT3R.