AR-RAG: Autoregressive Retrieval Augmentation for Image Generation

Jingyuan Qi, Zhiyang Xu, Qifan Wang, Lifu Huang

2025-06-17

AR-RAG: Autoregressive Retrieval Augmentation for Image Generation

Summary

This paper talks about AR-RAG, a new method that improves image generation by using a smart way to retrieve small parts of images called patches during the creation process. Instead of searching for full images once at the start, AR-RAG looks for image patches step-by-step as the image is being made, using information from the patches already created to find the most relevant visual pieces. This helps the model make more accurate and coherent images by focusing on local details dynamically.

What's the problem?

The problem is that older methods for improving image generation use a single, fixed retrieval step that looks for whole reference images before starting to generate, which can cause issues like copying too much from those images, copying irrelevant styles, or adding parts that don’t fit well. These static methods can’t adjust as the image is being created and often miss important local details, leading to lower quality and less flexible image generation.

What's the solution?

The solution presented is AR-RAG, which builds a large database of small image patches encoded with their surrounding context and performs retrievals at each step of the image generation. It uses the patches already generated as queries to retrieve the most similar and useful patches from the database, then merges this retrieved information with the model's own predictions to guide and improve the next patch it creates. Two frameworks are proposed: one combines predicted and retrieved patches during generation without retraining, and another fine-tunes the model to better use the retrieved information. This approach allows the model to adapt to changing needs throughout the generation process more effectively.

Why it matters?

This matters because it helps create better and more realistic images by paying attention to fine-grained visual details and updating retrievals as the image forms. By avoiding over-copying or irrelevant style transfers, AR-RAG makes image generation more flexible and accurate. This improvement can benefit many applications such as art creation, design, and any AI system needing high-quality image synthesis, pushing forward the capabilities of AI-generated images.

Abstract

Autoregressive Retrieval Augmentation enhances image generation through context-aware patch-level retrievals, improving performance over existing methods.

View Paper