VG-Refiner: Towards Tool-Refined Referring Grounded Reasoning via Agentic Reinforcement Learning
Yuji Wang, Wenlong Liu, Jingxuan Niu, Haoji Zhang, Yansong Tang
2025-12-09
Summary
This paper introduces a new way to help AI systems that use tools, like image detectors, to solve problems that require both seeing and understanding language. It focuses on making these systems more reliable when the tools they use aren't perfect.
What's the problem?
Current AI systems that use tools often struggle when those tools make mistakes. For example, if an AI is trying to find an object in an image using a detector, and the detector incorrectly identifies something, the AI can get completely thrown off and make up incorrect answers. This is especially true when the AI needs to understand what someone is *referring* to in an image and connect it to reasoning about the scene.
What's the solution?
The researchers created a framework called VG-Refiner. This system works in two steps: 'think' and 'rethink'. First, the AI makes an initial guess. Then, it checks the tool's output and, if the tool seems unreliable, it actively reconsiders its answer. They also developed a way to reward the AI for correcting its mistakes when the tool gives bad information. Finally, they created new ways to measure how well an AI can refine its reasoning based on tool feedback.
Why it matters?
This work is important because it makes AI systems that use tools more robust and trustworthy. By teaching AI to question and correct for tool errors, we can build systems that are less likely to hallucinate answers and more capable of solving complex problems that require both visual understanding and logical reasoning.
Abstract
Tool-integrated visual reasoning (TiVR) has demonstrated great potential in enhancing multimodal problem-solving. However, existing TiVR paradigms mainly focus on integrating various visual tools through reinforcement learning, while neglecting to design effective response mechanisms for handling unreliable or erroneous tool outputs. This limitation is particularly pronounced in referring and grounding tasks, where inaccurate detection tool predictions often mislead TiVR models into generating hallucinated reasoning. To address this issue, we propose the VG-Refiner, the first framework aiming at the tool-refined referring grounded reasoning. Technically, we introduce a two-stage think-rethink mechanism that enables the model to explicitly analyze and respond to tool feedback, along with a refinement reward that encourages effective correction in response to poor tool results. In addition, we propose two new metrics and establish fair evaluation protocols to systematically measure the refinement ability of current models. We adopt a small amount of task-specific data to enhance the refinement capability of VG-Refiner, achieving a significant improvement in accuracy and correction ability on referring and reasoning grounding benchmarks while preserving the general capabilities of the pretrained model.