Draft and Refine with Visual Experts
Sungheon Jeong, Ryozo Masukawa, Jihong Park, Sanggeon Yun, Wenjun Huang, Hanning Chen, Mahdi Imani, Mohsen Imani
2025-11-21
Summary
This paper addresses a problem with advanced AI models that can 'see' and 'understand' language – they sometimes make things up or don't actually use the visual information they're given when answering questions.
What's the problem?
Large vision-language models are really good at seeming like they understand both images and text, but they often rely too much on what they already 'know' from language and not enough on actually looking at the image. This leads to incorrect answers or responses that aren't supported by what's in the picture, which is called 'hallucination'. The core issue is that there wasn't a way to measure *how much* these models were actually using the visual information.
What's the solution?
The researchers created a system called 'Draft and Refine' (DnR). It works by first figuring out which parts of an image are most relevant to a question. Then, it checks how much the model's answer changes if those important parts are hidden. This gives a score showing how much the model relies on the image. DnR then uses this score to get feedback from 'visual experts' – tools that highlight specific areas in the image – and refines its answer to better focus on the visual evidence, all without changing the model itself.
Why it matters?
This work is important because it provides a way to make these AI models more trustworthy and reliable. By measuring and improving how much they use visual information, we can reduce the chances of them making up answers and create AI systems that are more grounded in reality and easier to understand.
Abstract
While recent Large Vision-Language Models (LVLMs) exhibit strong multimodal reasoning abilities, they often produce ungrounded or hallucinated responses because they rely too heavily on linguistic priors instead of visual evidence. This limitation highlights the absence of a quantitative measure of how much these models actually use visual information during reasoning. We propose Draft and Refine (DnR), an agent framework driven by a question-conditioned utilization metric. The metric quantifies the model's reliance on visual evidence by first constructing a query-conditioned relevance map to localize question-specific cues and then measuring dependence through relevance-guided probabilistic masking. Guided by this metric, the DnR agent refines its initial draft using targeted feedback from external visual experts. Each expert's output (such as boxes or masks) is rendered as visual cues on the image, and the model is re-queried to select the response that yields the largest improvement in utilization. This process strengthens visual grounding without retraining or architectural changes. Experiments across VQA and captioning benchmarks show consistent accuracy gains and reduced hallucination, demonstrating that measuring visual utilization provides a principled path toward more interpretable and evidence-driven multimodal agent systems.