Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing

Runze He, Yiji Cheng, Tiankai Hang, Zhimin Li, Yu Xu, Zijin Yin, Shiyi Zhang, Wenxun Dai, Penghui Du, Ao Ma, Chunyu Wang, Qinglin Lu, Jizhong Han, Jiao Dai

2026-01-09

Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing

Summary

This paper focuses on improving how AI models create and change images based on a series of text and image prompts given to them at the same time, a process called in-context image generation and editing.

What's the problem?

Current AI models are really good at *understanding* what you want when you give them images and text together, but they struggle to actually *create* or *edit* images to match that understanding accurately. They get confused about which parts of the reference images are important and how to combine the text instructions with the visual information, leading to images that don't quite match what the user intended.

What's the solution?

The researchers developed a new system called Re-Align. It works by breaking down the process into clear steps, almost like thinking through the problem before drawing. This involves a technique called 'In-Context Chain-of-Thought' which separates the overall goal (what the text says) from the specific details in the example images. Then, they used a special training method that rewards the AI when the generated image closely matches the reasoning steps it took, ensuring better alignment between the idea and the final image.

Why it matters?

This research is important because it makes AI image generation and editing much more reliable and accurate. It means you can give an AI a few examples and some instructions, and it will be better at creating or modifying images exactly as you envision, opening up possibilities for more creative control and practical applications.

Abstract

In-context image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts, demanding precise understanding and faithful execution of user intent. Although recent unified multimodal models exhibit promising understanding capabilities, these strengths often fail to transfer effectively to image generation. We introduce Re-Align, a unified framework that bridges the gap between understanding and generation through structured reasoning-guided alignment. At its core lies the In-Context Chain-of-Thought (IC-CoT), a structured reasoning paradigm that decouples semantic guidance and reference association, providing clear textual target and mitigating confusion among reference images. Furthermore, Re-Align introduces an effective RL training scheme that leverages a surrogate reward to measure the alignment between structured reasoning text and the generated image, thereby improving the model's overall performance on ICGE tasks. Extensive experiments verify that Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.

View Paper