Multi-view Pyramid Transformer: Look Coarser to See Broader

Gyeongjin Kang, Seungkwon Yang, Seungtae Nam, Younggeun Lee, Jungwoo Kim, Eunbyung Park

2025-12-09

Multi-view Pyramid Transformer: Look Coarser to See Broader

Summary

This paper introduces a new system called Multi-view Pyramid Transformer, or MVP, which quickly and accurately builds 3D models of scenes from many different pictures taken from various angles.

What's the problem?

Creating detailed 3D models from a large number of images is computationally expensive and time-consuming. Existing methods often struggle to handle scenes with lots of detail or a large number of viewpoints, meaning they either take a long time or don't produce high-quality results. The challenge is to efficiently process all the visual information to create a complete and accurate 3D representation.

What's the solution?

MVP tackles this problem by using a clever two-part approach. First, it organizes the images in a 'local-to-global' way, starting by looking at small groups of images and gradually expanding its view to the entire scene. Second, within each image, it starts with detailed information and progressively simplifies it into more manageable pieces. This combination allows MVP to process a lot of images quickly while still capturing important details, and it works especially well when combined with a technique called 3D Gaussian Splatting to actually create the 3D model.

Why it matters?

This research is important because it makes creating realistic 3D models much faster and more efficient. This has implications for many fields, like virtual reality, robotics, and creating digital twins of real-world environments. Being able to quickly reconstruct scenes from images opens up possibilities for more immersive experiences and better understanding of the world around us.

Abstract

We propose Multi-view Pyramid Transformer (MVP), a scalable multi-view transformer architecture that directly reconstructs large 3D scenes from tens to hundreds of images in a single forward pass. Drawing on the idea of ``looking broader to see the whole, looking finer to see the details," MVP is built on two core design principles: 1) a local-to-global inter-view hierarchy that gradually broadens the model's perspective from local views to groups and ultimately the full scene, and 2) a fine-to-coarse intra-view hierarchy that starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens. This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. We validate MVP on diverse datasets and show that, when coupled with 3D Gaussian Splatting as the underlying 3D representation, it achieves state-of-the-art generalizable reconstruction quality while maintaining high efficiency and scalability across a wide range of view configurations.

View Paper