Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats
Chen Ziwen, Hao Tan, Kai Zhang, Sai Bi, Fujun Luan, Yicong Hong, Li Fuxin, Zexiang Xu
2024-10-18

Summary
This paper presents Long-LRM, a new model designed to reconstruct large 3D scenes from a long sequence of images quickly and efficiently.
What's the problem?
Reconstructing 3D scenes from images is important for various applications, but many existing models can only handle a small number of images at a time (usually 1 to 4) and take a long time to process. This limits their ability to create detailed and complete representations of larger scenes.
What's the solution?
The authors developed Long-LRM, which can process up to 32 high-resolution images simultaneously and reconstruct an entire large scene in just 1.3 seconds using a powerful GPU. They achieved this by combining advanced techniques like Mamba2 blocks and traditional transformer blocks, allowing the model to handle more data efficiently. Additionally, they implemented methods like token merging and Gaussian pruning to improve both the speed and quality of the reconstruction. This means that Long-LRM can create detailed 3D models much faster than previous methods.
Why it matters?
This research is significant because it allows for faster and more efficient creation of 3D models, which can be used in fields like virtual reality, gaming, and robotics. By enabling the reconstruction of large scenes quickly, Long-LRM opens up new possibilities for creating immersive experiences and improving technologies that rely on detailed visual representations.
Abstract
We propose Long-LRM, a generalizable 3D Gaussian reconstruction model that is capable of reconstructing a large scene from a long sequence of input images. Specifically, our model can process 32 source images at 960x540 resolution within only 1.3 seconds on a single A100 80G GPU. Our architecture features a mixture of the recent Mamba2 blocks and the classical transformer blocks which allowed many more tokens to be processed than prior work, enhanced by efficient token merging and Gaussian pruning steps that balance between quality and efficiency. Unlike previous feed-forward models that are limited to processing 1~4 input images and can only reconstruct a small portion of a large scene, Long-LRM reconstructs the entire scene in a single feed-forward step. On large-scale scene datasets such as DL3DV-140 and Tanks and Temples, our method achieves performance comparable to optimization-based approaches while being two orders of magnitude more efficient. Project page: https://arthurhero.github.io/projects/llrm