GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens
Roni Itkin, Noam Issachar, Yehonatan Keypur, Yehonatan Keypur, Anpei Chen, Sagie Benaim
2026-04-17
Summary
This paper introduces a new method called GlobalSplat for creating 3D models from multiple pictures, focusing on making the process more efficient and producing higher quality results.
What's the problem?
Existing methods for creating 3D models from images struggle to balance three key things: keeping the model file size small, reconstructing the scene quickly, and making the final image look realistic. Many current approaches rely on looking at the scene from individual viewpoints and placing 3D elements based on what's visible in each picture. This creates a lot of unnecessary duplication and makes it hard to maintain consistency as you add more viewpoints, leading to large file sizes and potential errors in the final model.
What's the solution?
GlobalSplat takes a different approach by first learning a compact, overall understanding of the entire scene from all the input images *before* actually building the 3D geometry. It essentially figures out how everything relates to everything else first. This 'align first, decode later' strategy avoids the redundancy of previous methods and allows for a much more efficient and consistent 3D reconstruction. They also use a training process that gradually increases the detail of the model to prevent it from becoming unnecessarily complex.
Why it matters?
This research is important because it allows for the creation of detailed 3D models with significantly fewer 3D elements (Gaussians) than previous methods, resulting in much smaller file sizes and faster rendering times. This makes it more practical to create and share complex 3D scenes, and opens up possibilities for applications where speed and efficiency are critical.
Abstract
The efficient spatial allocation of primitives serves as the foundation of 3D Gaussian Splatting, as it directly dictates the synergy between representation compactness, reconstruction speed, and rendering fidelity. Previous solutions, whether based on iterative optimization or feed-forward inference, suffer from significant trade-offs between these goals, mainly due to the reliance on local, heuristic-driven allocation strategies that lack global scene awareness. Specifically, current feed-forward methods are largely pixel-aligned or voxel-aligned. By unprojecting pixels into dense, view-aligned primitives, they bake redundancy into the 3D asset. As more input views are added, the representation size increases and global consistency becomes fragile. To this end, we introduce GlobalSplat, a framework built on the principle of align first, decode later. Our approach learns a compact, global, latent scene representation that encodes multi-view input and resolves cross-view correspondences before decoding any explicit 3D geometry. Crucially, this formulation enables compact, globally consistent reconstructions without relying on pretrained pixel-prediction backbones or reusing latent features from dense baselines. Utilizing a coarse-to-fine training curriculum that gradually increases decoded capacity, GlobalSplat natively prevents representation bloat. On RealEstate10K and ACID, our model achieves competitive novel-view synthesis performance while utilizing as few as 16K Gaussians, significantly less than required by dense pipelines, obtaining a light 4MB footprint. Further, GlobalSplat enables significantly faster inference than the baselines, operating under 78 milliseconds in a single forward pass. Project page is available at https://r-itk.github.io/globalsplat/