GaussianGPT: Towards Autoregressive 3D Gaussian Scene Generation

Nicolas von Lützow, Barbara Rössle, Katharina Schmid, Matthias Nießner

2026-04-02

GaussianGPT: Towards Autoregressive 3D Gaussian Scene Generation

Summary

This paper introduces a new way to create 3D models and scenes using a type of artificial intelligence called GaussianGPT. It's different from current methods because it builds 3D objects step-by-step, like writing a story one word at a time, instead of starting with noise and refining it.

What's the problem?

Existing methods for generating 3D models often rely on complex processes like diffusion or flow-matching, which can be computationally expensive and sometimes lack precise control over the final result. These methods also struggle with tasks like completing a partially created scene or easily extending an existing one.

What's the solution?

The researchers developed GaussianGPT, which uses a 'transformer' – a powerful AI architecture known for language processing – to directly generate 3D shapes represented as Gaussians. They first compress the 3D information into a series of 'tokens' and then the transformer predicts the next token in the sequence, building the 3D scene progressively. This approach uses a special way of encoding 3D position information to help the transformer understand spatial relationships.

Why it matters?

This new method offers a more controllable and flexible way to create 3D content. Because it builds scenes sequentially, it's easier to complete unfinished models, extend existing scenes, and adjust the level of detail. It also opens up possibilities for more creative control and could be a valuable tool for artists and designers working with 3D graphics.

Abstract

Most recent advances in 3D generative modeling rely on diffusion or flow-matching formulations. We instead explore a fully autoregressive alternative and introduce GaussianGPT, a transformer-based model that directly generates 3D Gaussians via next-token prediction, thus facilitating full 3D scene generation. We first compress Gaussian primitives into a discrete latent grid using a sparse 3D convolutional autoencoder with vector quantization. The resulting tokens are serialized and modeled using a causal transformer with 3D rotary positional embedding, enabling sequential generation of spatial structure and appearance. Unlike diffusion-based methods that refine scenes holistically, our formulation constructs scenes step-by-step, naturally supporting completion, outpainting, controllable sampling via temperature, and flexible generation horizons. This formulation leverages the compositional inductive biases and scalability of autoregressive modeling while operating on explicit representations compatible with modern neural rendering pipelines, positioning autoregressive transformers as a complementary paradigm for controllable and context-aware 3D generation.

View Paper