Heptapod: Language Modeling on Visual Signals
Yongxin Zhu, Jiawei Chen, Yuanzhe Chen, Zhuo Chen, Dongya Jia, Jian Cong, Xiaobin Zhuang, Yuping Wang, Yuxuan Wang
2025-10-09
Summary
This paper introduces Heptapod, a new type of image generation model that takes inspiration from how language models work, but applies it to creating pictures instead of text.
What's the problem?
Existing image generation models often rely on complicated methods for understanding images and generating new ones, sometimes using complex rules or ways of breaking down images into pieces. Many also don't fully capture the overall meaning and structure of an image during the learning process, leading to less realistic or coherent results. Previous attempts at building image models like language models haven't performed very well.
What's the solution?
The researchers created Heptapod, which predicts what the next part of an image should be, pixel by pixel, in a sequential manner, similar to how a language model predicts the next word in a sentence. It uses a 'visual tokenizer' to prepare the image for processing and focuses on reconstructing the entire image at each step. This approach combines the benefits of predicting sequences with learning the overall image structure, allowing the model to understand and generate images more effectively.
Why it matters?
Heptapod achieves state-of-the-art results in image generation, meaning it creates more realistic and high-quality images than previous similar models. More importantly, it suggests a new, more principled way to approach image generation by applying the successful techniques used in language modeling, potentially leading to further advancements in the field and beyond just images.
Abstract
We introduce Heptapod, an image autoregressive model that adheres to the foundational principles of language modeling. Heptapod employs causal attention, eliminates reliance on CFG, and eschews the trend of semantic tokenizers. Our key innovation is next 2D distribution prediction: a causal Transformer with reconstruction-focused visual tokenizer, learns to predict the distribution over the entire 2D spatial grid of images at each timestep. This learning objective unifies the sequential modeling of autoregressive framework with the holistic self-supervised learning of masked autoencoding, enabling the model to capture comprehensive image semantics via generative training. On the ImageNet generation benchmark, Heptapod achieves an FID of 2.70, significantly outperforming previous causal autoregressive approaches. We hope our work inspires a principled rethinking of language modeling on visual signals and beyond.