High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
Xinyu Huang, Yuhao Dong, Weiwei Tian, Bo Li, Rui Feng, Ziwei Liu
2025-07-09
Summary
This paper talks about MGPO, a new way to improve how large models understand and analyze images when answering questions about them. It uses a technique where the model repeatedly focuses on smaller parts of the image to better understand details.
What's the problem?
The problem is that current models struggle to accurately find and understand important details in high-resolution images for visual question answering, especially without extra help like detailed annotations showing where to look.
What's the solution?
The researchers created MGPO, which uses reinforcement learning to teach the model to zoom in on sub-images step-by-step and focus on relevant parts. This approach helps the model answer questions more accurately by learning to explore images more effectively without needing extra labeled data.
Why it matters?
This matters because it allows AI to better analyze complex images and answer questions about them more precisely, which can be useful in many fields like medicine, robotics, and education.
Abstract
MGPO, an end-to-end reinforcement learning framework, enhances large multi-modal models' visual grounding abilities through iterative sub-image cropping, improving performance on visual-question answering tasks without additional grounding annotations.