FARMER: Flow AutoRegressive Transformer over Pixels

Guangting Zheng, Qinyu Zhao, Tao Yang, Fei Xiao, Zhijie Lin, Jie Wu, Jiajun Deng, Yanyong Zhang, Rui Zhu

2025-10-28

FARMER: Flow AutoRegressive Transformer over Pixels

Summary

This paper introduces FARMER, a new way to generate images directly from their pixel data using a combination of two types of machine learning models: Normalizing Flows and Autoregressive models. It aims to create high-quality images while also being able to accurately calculate how likely a given image is, something that's difficult with other methods.

What's the problem?

Creating realistic images with computers is hard, especially when working directly with the raw pixel data. Traditional methods that predict the next pixel in a sequence (autoregressive modeling) struggle with images because they require processing very long sequences of information and dealing with the huge amount of color information in each pixel. This makes training slow and the process inefficient.

What's the solution?

FARMER solves this by first transforming the image into a more manageable form using a technique called an invertible autoregressive flow. This flow rearranges the pixel information into a sequence that's easier to work with. Then, an autoregressive model learns the patterns in this rearranged sequence. To make this even more efficient, the researchers developed a way to identify and remove unnecessary information from the sequence before the autoregressive model processes it. They also sped up image generation with a distillation technique and improved quality using a guidance method.

Why it matters?

This research is important because it provides a new framework for generating images that is both high-quality and allows for precise calculation of image likelihoods. This is useful for a variety of applications, including creating realistic visual content, understanding how machine learning models 'see' images, and potentially improving other image-based AI systems. It also offers a more scalable approach to training these models, meaning they can handle larger and more complex images.

Abstract

Directly modeling the explicit likelihood of the raw data distribution is key topic in the machine learning area, which achieves the scaling successes in Large Language Models by autoregressive modeling. However, continuous AR modeling over visual pixel data suffer from extremely long sequences and high-dimensional spaces. In this paper, we present FARMER, a novel end-to-end generative framework that unifies Normalizing Flows (NF) and Autoregressive (AR) models for tractable likelihood estimation and high-quality image synthesis directly from raw pixels. FARMER employs an invertible autoregressive flow to transform images into latent sequences, whose distribution is modeled implicitly by an autoregressive model. To address the redundancy and complexity in pixel-level modeling, we propose a self-supervised dimension reduction scheme that partitions NF latent channels into informative and redundant groups, enabling more effective and efficient AR modeling. Furthermore, we design a one-step distillation scheme to significantly accelerate inference speed and introduce a resampling-based classifier-free guidance algorithm to boost image generation quality. Extensive experiments demonstrate that FARMER achieves competitive performance compared to existing pixel-based generative models while providing exact likelihoods and scalable training.

View Paper