VGR: Visual Grounded Reasoning

Jiacong Wang, Zijiang Kang, Haochen Wang, Haiyong Jiang, Jiawen Li, Bohong Wu, Ya Wang, Jiao Ran, Xiao Liang, Chao Feng, Jun Xiao

2025-06-17

Summary

This paper talks about VGR, a new kind of AI model that combines understanding images and language to reason better about what it sees. VGR improves its visual reasoning by first finding the important parts of an image and then using that information to think and answer questions more accurately. It works better than previous models on tests that involve both pictures and words while using less computer power.

What's the problem?

The problem is that many AI models have a hard time focusing on the right parts of an image to understand it properly. They might look at unnecessary or unrelated details, which makes their reasoning and answers less accurate. Also, some existing models require lots of computing resources, which can make them slow and expensive to run.

What's the solution?

The solution is VGR’s approach to detect and focus on the most relevant regions of an image and then directly include that focused information in its reasoning process. This targeted attention helps the model better connect what it sees with language understanding, leading to improved accuracy while being more efficient in resource use compared to earlier multimodal models.

Why it matters?

This matters because improving how AI models reason with both images and language helps create smarter systems. These systems can assist in many areas like education, healthcare, and technology by better interpreting visual information quickly and accurately without needing huge computing power, making AI more accessible and useful in real life.

Abstract

VGR, a novel multimodal large language model, improves visual reasoning by detecting relevant image regions and integrating them into the reasoning process, outperforming existing models on multimodal benchmarks with reduced resource usage.

View Paper