MGPO, an end-to-end reinforcement learning framework, enhances large multi-modal models' visual grounding abilities through iterative sub-image cropping, improving performance on visual-question answering tasks without additional grounding annotations.

This paper talks about MGPO, a new way to improve how large models understand and analyze images when answering questions about them. It uses a technique where the model repeatedly focuses on smaller parts of the image to better understand details.

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

Summary

What's the problem?

What's the solution?

Why it matters?

Abstract