The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment
Ziheng Ouyang, Yiren Song, Yaoli Liu, Shihao Zhu, Qibin Hou, Ming-Ming Cheng, Mike Zheng Shou
2025-12-02
Summary
This paper introduces ImageCritic, a new system designed to improve the quality of images created by AI, specifically focusing on making sure the details are consistent and accurate.
What's the problem?
Current AI image generators often struggle with creating images that have consistent, fine-grained details when asked to make changes based on a reference image. They might add or alter things in a way that doesn't quite match the original or doesn't make logical sense, leading to inconsistencies and inaccuracies in the generated image.
What's the solution?
The researchers created a special dataset to train the AI to recognize and fix these inconsistencies. They figured out *where* the AI was looking when making mistakes by studying its 'attention' and how it represents information. Then, they developed a way to specifically correct those attention areas and improve the encoding of details, allowing ImageCritic to automatically detect and fix problems through multiple rounds of editing, focusing on the specific areas that need improvement.
Why it matters?
This work is important because it directly addresses a major weakness in current AI image generation technology. By improving the consistency and accuracy of details, ImageCritic helps create more realistic and reliable images, making these AI tools more useful for a wider range of applications like design, art, and potentially even scientific visualization.
Abstract
Previous works have explored various customized generation tasks given a reference image, but they still face limitations in generating consistent fine-grained details. In this paper, our aim is to solve the inconsistency problem of generated images by applying a reference-guided post-editing approach and present our ImageCritic. We first construct a dataset of reference-degraded-target triplets obtained via VLM-based selection and explicit degradation, which effectively simulates the common inaccuracies or inconsistencies observed in existing generation models. Furthermore, building on a thorough examination of the model's attention mechanisms and intrinsic representations, we accordingly devise an attention alignment loss and a detail encoder to precisely rectify inconsistencies. ImageCritic can be integrated into an agent framework to automatically detect inconsistencies and correct them with multi-round and local editing in complex scenarios. Extensive experiments demonstrate that ImageCritic can effectively resolve detail-related issues in various customized generation scenarios, providing significant improvements over existing methods.