VGG-T^3: Offline Feed-Forward 3D Reconstruction at Scale

Sven Elflein, Ruilong Li, Sérgio Agostinho, Zan Gojcic, Laura Leal-Taixé, Qunjie Zhou, Aljosa Osep

2026-02-27

VGG-T^3: Offline Feed-Forward 3D Reconstruction at Scale

Summary

This paper introduces a new way to quickly build 3D models from a bunch of 2D images, focusing on making the process much faster than existing methods.

What's the problem?

Currently, creating detailed 3D models from many images takes a really long time and a lot of computer power. The amount of processing needed increases dramatically as you add more images – specifically, it grows with the *square* of the number of images. This makes it hard to reconstruct large scenes efficiently because of memory and computational limitations.

What's the solution?

The researchers realized the problem comes from how the computer stores information about the scene's geometry. They found a way to simplify this information, essentially squeezing it into a more manageable format using a type of neural network called a Multi-Layer Perceptron. This 'squeezing' happens during the model's use, not beforehand, through a process called 'test-time training'. This allows the model to work much faster, scaling linearly with the number of images, similar to methods that process images one at a time, but still capturing the overall scene.

Why it matters?

This new method is a big step forward because it significantly speeds up 3D reconstruction. It's over eleven times faster than previous approaches while still creating accurate models. This speed-up opens the door to applications like quickly building virtual environments or allowing robots to understand their surroundings in real-time, and it also allows the model to find where a new image fits within the reconstructed scene.

Abstract

We present a scalable 3D reconstruction model that addresses a critical limitation in offline feed-forward methods: their computational and memory requirements grow quadratically w.r.t. the number of input images. Our approach is built on the key insight that this bottleneck stems from the varying-length Key-Value (KV) space representation of scene geometry, which we distill into a fixed-size Multi-Layer Perceptron (MLP) via test-time training. VGG-T^3 (Visual Geometry Grounded Test Time Training) scales linearly w.r.t. the number of input views, similar to online models, and reconstructs a 1k image collection in just 54 seconds, achieving a 11.6times speed-up over baselines that rely on softmax attention. Since our method retains global scene aggregation capability, our point map reconstruction error outperforming other linear-time methods by large margins. Finally, we demonstrate visual localization capabilities of our model by querying the scene representation with unseen images.

View Paper