Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models
Jiayi Guo, Linqing Wang, Jiangshan Wang, Yang Yue, Zeyu Liu, Zhiyuan Zhao, Qinglin Lu, Gao Huang, Chunyu Wang
2026-04-29
Summary
This paper introduces a new way for AI models to improve images they create from text descriptions, focusing on making the images more accurately match what the text asks for.
What's the problem?
Current AI models that refine images after initially creating them work by trying to 'edit' the existing image based on instructions. This approach isn't very precise, often only fixing problems in a rough way and unnecessarily limiting how much the model can change to get the image right. It's like trying to fix a painting by only being allowed to make tiny adjustments instead of repainting sections.
What's the solution?
The researchers propose a new method called 'Refinement via Regeneration' (RvR). Instead of editing, the model essentially recreates the image from scratch, but uses both the original text description *and* a kind of 'summary' of the initial image as a guide. This allows for more significant and accurate changes, leading to a better final image because it isn't restricted by needing to preserve every detail of the original.
Why it matters?
This research is important because it significantly improves the quality of images generated by AI. The new method achieves better results on standard tests, meaning AI can create images that more closely match what people intend, which is a crucial step towards more useful and reliable AI image generation.
Abstract
Unified multimodal models (UMMs) integrate visual understanding and generation within a single framework. For text-to-image (T2I) tasks, this unified capability allows UMMs to refine outputs after their initial generation, potentially extending the performance upper bound. Current UMM-based refinement methods primarily follow a refinement-via-editing (RvE) paradigm, where UMMs produce editing instructions to modify misaligned regions while preserving aligned content. However, editing instructions often describe prompt-image misalignment only coarsely, leading to incomplete refinement. Moreover, pixel-level preservation, though necessary for editing, unnecessarily restricts the effective modification space for refinement. To address these limitations, we propose Refinement via Regeneration (RvR), a novel framework that reformulates refinement as conditional image regeneration rather than editing. Instead of relying on editing instructions and enforcing strict content preservation, RvR regenerates images conditioned on the target prompt and the semantic tokens of the initial image, enabling more complete semantic alignment with a larger modification space. Extensive experiments demonstrate the effectiveness of RvR, improving Geneval from 0.78 to 0.91, DPGBench from 84.02 to 87.21, and UniGenBench++ from 61.53 to 77.41.