ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, Cha Zhang
2025-01-13

Summary
This paper talks about ReFocus, a new AI system that helps computers understand complex images like tables and charts by allowing them to 'think visually' through image editing techniques.
What's the problem?
Current AI models struggle to understand complex images like tables and charts because they can't focus on different parts of the image in a step-by-step way. This makes it hard for them to interpret the information correctly and come up with accurate answers.
What's the solution?
The researchers created ReFocus, which allows AI models to generate 'visual thoughts' by editing images. It uses Python code to draw boxes, highlight sections, and mask areas of the image. This helps the AI focus on specific parts of the image one at a time, similar to how humans might analyze a complex diagram. The system was tested on various tasks involving tables and charts, and it significantly improved the AI's performance compared to previous methods.
Why it matters?
ReFocus matters because it makes AI much better at understanding complex visual information, which is crucial for many real-world applications. It improved performance on table tasks by 11% and chart tasks by 6.8% compared to previous methods. This could lead to more accurate and efficient AI systems for analyzing data in fields like science, business, and education. Additionally, the method of training AI using this 'visual chain-of-thought' approach proved to be more effective than traditional methods, potentially changing how we train AI systems in the future.
Abstract
Structured image understanding, such as interpreting tables and charts, requires strategically refocusing across various structures and texts within an image, forming a reasoning sequence to arrive at the final answer. However, current multimodal large language models (LLMs) lack this multihop selective attention capability. In this work, we introduce ReFocus, a simple yet effective framework that equips multimodal LLMs with the ability to generate "visual thoughts" by performing visual editing on the input image through code, shifting and refining their visual focuses. Specifically, ReFocus enables multimodal LLMs to generate Python codes to call tools and modify the input image, sequentially drawing boxes, highlighting sections, and masking out areas, thereby enhancing the visual reasoning process. We experiment upon a wide range of structured image understanding tasks involving tables and charts. ReFocus largely improves performance on all tasks over GPT-4o without visual editing, yielding an average gain of 11.0% on table tasks and 6.8% on chart tasks. We present an in-depth analysis of the effects of different visual edits, and reasons why ReFocus can improve the performance without introducing additional information. Further, we collect a 14k training set using ReFocus, and prove that such visual chain-of-thought with intermediate information offers a better supervision than standard VQA data, reaching a 8.0% average gain over the same model trained with QA pairs and 2.6% over CoT.