Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, Zhuochen Wang, Zhaoxiang Zhang

2025-07-11

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and
Methodology

Summary

This paper introduces TreeBench, a new method to test how well AI can understand and reason about images, focusing on finding small details, giving clear evidence for its answers, and thinking about how objects relate to each other in complex ways. It also presents TreeVGR, which improves this reasoning ability by using reinforcement learning.

What's the problem?

Current AI models can struggle to accurately find tiny or subtle objects in pictures, explain how they reached their conclusions, and understand complex relationships between objects beyond just identifying where they are. There was no proper test to evaluate these skills together.

What's the solution?

The researchers created TreeBench, a thorough benchmark with carefully chosen images and detailed questions that require the AI to provide precise location of objects along with answers, allowing the reasoning process to be checked. They also developed TreeVGR, a training method that uses reinforcement learning to teach the AI to link visuals and reasoning steps in a clear way, improving both its accuracy and its ability to explain itself.

Why it matters?

This matters because it helps improve AI’s ability to think like humans when looking at images, making AI systems smarter at understanding visuals and explaining their thought processes, which is crucial for applications where trust and transparency matter.

Abstract

TreeBench evaluates visual grounded reasoning with a focus on subtle targets, traceable evidence, and second-order reasoning, while TreeVGR enhances this reasoning using reinforcement learning.

View Paper