InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams

Shuai Yuan, Yantai Yang, Xiaotian Yang, Xupeng Zhang, Zhonghao Zhao, Lingming Zhang, Zhipeng Zhang

2026-01-06

InfiniteVGGT: Visual Geometry Grounded Transformer for Endless Streams

Summary

This paper introduces InfiniteVGGT, a new system for understanding 3D environments from continuous video streams, and a new benchmark called Long3D for testing these kinds of systems.

What's the problem?

Currently, there's a trade-off between how well a computer can understand the 3D geometry of a scene and how long it can continuously process video. Existing methods either work well but can't handle live, ongoing video, or they *can* handle live video but quickly become inaccurate over time as they 'forget' earlier parts of the scene. It's hard to build a system that can reliably understand 3D geometry from a video that keeps going and going without losing track of what it's seen.

What's the solution?

The researchers created InfiniteVGGT, which uses a special type of memory called a 'rolling cache'. This memory stores important information about the scene, but intelligently discards older, less relevant details to make room for new information. This allows the system to continuously process video without getting bogged down or losing accuracy. They also developed a way to efficiently prune this memory without needing extra training. It works well with existing fast processing techniques like FlashAttention.

Why it matters?

This work is important because it allows for the creation of systems that can continuously understand the 3D world around them, like self-driving cars or robots navigating complex environments. To prove this, they also created Long3D, a very long video dataset (10,000 frames!) specifically designed to test how well these systems perform over extended periods, providing a standard way to measure progress in this field.

Abstract

The grand vision of enabling persistent, large-scale 3D visual geometry understanding is shackled by the irreconcilable demands of scalability and long-term stability. While offline models like VGGT achieve inspiring geometry capability, their batch-based nature renders them irrelevant for live systems. Streaming architectures, though the intended solution for live operation, have proven inadequate. Existing methods either fail to support truly infinite-horizon inputs or suffer from catastrophic drift over long sequences. We shatter this long-standing dilemma with InfiniteVGGT, a causal visual geometry transformer that operationalizes the concept of a rolling memory through a bounded yet adaptive and perpetually expressive KV cache. Capitalizing on this, we devise a training-free, attention-agnostic pruning strategy that intelligently discards obsolete information, effectively ``rolling'' the memory forward with each new frame. Fully compatible with FlashAttention, InfiniteVGGT finally alleviates the compromise, enabling infinite-horizon streaming while outperforming existing streaming methods in long-term stability. The ultimate test for such a system is its performance over a truly infinite horizon, a capability that has been impossible to rigorously validate due to the lack of extremely long-term, continuous benchmarks. To address this critical gap, we introduce the Long3D benchmark, which, for the first time, enables a rigorous evaluation of continuous 3D geometry estimation on sequences about 10,000 frames. This provides the definitive evaluation platform for future research in long-term 3D geometry understanding. Code is available at: https://github.com/AutoLab-SAI-SJTU/InfiniteVGGT

View Paper