VA-π: Variational Policy Alignment for Pixel-Aware Autoregressive Generation
Xinyao Liao, Qiyuan He, Kai Xu, Xiaoye Qu, Yicong Li, Wei Wei, Angela Yao
2025-12-26
Summary
This paper addresses a problem with how images are created using a specific type of artificial intelligence called autoregressive models. These models break down images into smaller pieces (like tokens) to generate new ones, but there's a disconnect between how those pieces are initially learned and how the model actually uses them to create images.
What's the problem?
Autoregressive image generators rely on 'tokenizers' to convert images into a sequence of tokens, and back again. The tokenizer is really good at recognizing and reconstructing *perfect* images, but the generator is only focused on predicting the most likely sequence of tokens. This means the generator can create a sequence of tokens that *sounds* right, but when you turn it back into an image, it looks bad because it wasn't directly considering how realistic the final image would be. Essentially, the generator isn't getting feedback on the actual image quality, only on the token sequence.
What's the solution?
The researchers developed a new framework called VA-π that adds a direct link between the generated image and how good it looks. They treat the image generator like it's making decisions (a 'policy') and give it a 'reward' based on how well the generated tokens can reconstruct the original image. This reward is calculated by seeing how closely the generated image matches the original, guiding the generator to create better-looking images. They also use a mathematical trick called an 'evidence lower bound' to make sure the tokens stay consistent and don't drift too far from what the tokenizer originally learned. Importantly, this method doesn't require retraining the tokenizer or using complicated external tools.
Why it matters?
This work is important because it significantly improves the quality of images generated by these autoregressive models. They were able to get much better results with very little training data and time, showing that their method is efficient and effective. This means we can get more realistic and detailed images from AI, which has implications for many applications like art, design, and even scientific visualization.
Abstract
Autoregressive (AR) visual generation relies on tokenizers to map images to and from discrete sequences. However, tokenizers are trained to reconstruct clean images from ground-truth tokens, while AR generators are optimized only for token likelihood. This misalignment leads to generated token sequences that may decode into low-quality images, without direct supervision from the pixel space. We propose VA-π, a lightweight post-training framework that directly optimizes AR models with a principled pixel-space objective. VA-π formulates the generator-tokenizer alignment as a variational optimization, deriving an evidence lower bound (ELBO) that unifies pixel reconstruction and autoregressive modeling. To optimize under the discrete token space, VA-π introduces a reinforcement-based alignment strategy that treats the AR generator as a policy, uses pixel-space reconstruction quality as its intrinsic reward. The reward is measured by how well the predicted token sequences can reconstruct the original image under teacher forcing, giving the model direct pixel-level guidance without expensive free-running sampling. The regularization term of the ELBO serves as a natural regularizer, maintaining distributional consistency of tokens. VA-π enables rapid adaptation of existing AR generators, without neither tokenizer retraining nor external reward models. With only 1% ImageNet-1K data and 25 minutes of tuning, it reduces FID from 14.36 to 7.65 and improves IS from 86.55 to 116.70 on LlamaGen-XXL, while also yielding notable gains in the text-to-image task on GenEval for both visual generation model (LlamaGen: from 0.306 to 0.339) and unified multi-modal model (Janus-Pro: from 0.725 to 0.744). Code is available at https://github.com/Lil-Shake/VA-Pi.