ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

Shengyuan Ding, Xinyu Fang, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiangyu Zhao, Haodong Duan, Xiaoyi Dong, Jianze Liang, Bin Wang, Conghui He, Dahua Lin, Jiaqi Wang

2025-12-05

ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

Summary

This paper introduces a new type of reward model for vision-language AI systems called ARM-Thinker, which is designed to be more reliable and accurate when judging how well an AI is performing a task that involves both images and text.

What's the problem?

Current reward models, which are used to teach AI systems what humans prefer, have some major flaws. They often 'hallucinate' – meaning they make things up – and struggle to accurately connect their judgments to specific details in images. They also can't use external tools to double-check their reasoning, making them unreliable for complex tasks that require careful thought and verification.

What's the solution?

ARM-Thinker solves this by acting like an 'agent' that can actively use tools like cropping images or searching for information in documents. Instead of just giving a score, it *shows its work* by using these tools to gather evidence and justify its judgments. The researchers trained this model using a special method that helps it learn both *when* to use tools and *how* to make accurate assessments. They also created a new set of tests, ARMBench-VL, to specifically evaluate these agentic reward models.

Why it matters?

This research is important because it significantly improves the accuracy and trustworthiness of reward models. By allowing AI to verify its own reasoning, ARM-Thinker leads to better performance on tasks involving visual details, understanding complex documents, and following instructions. This is a step towards building AI systems that are not only intelligent but also explainable and reliable, which is crucial for real-world applications.

Abstract

Reward models are critical for aligning vision-language systems with human preferences, yet current approaches suffer from hallucination, weak visual grounding, and an inability to use tools for verification, limiting their reliability on complex multimodal reasoning tasks. We present ARM-Thinker, an A}gentic multimodal Reward Model that autonomously invokes external tools (e.g., image cropping, doc page retrieval) to ground judgments in verifiable evidence, replacing static, non-interactive reward scoring. This enables the model to verify fine-grained visual details, cross-reference multi-page evidence, and validate reasoning claims, which are capabilities absent in existing reward models. We train ARM-Thinker with multi-stage reinforcement learning, jointly optimizing tool-calling decisions and judgment accuracy. To evaluate agentic reward modeling, we introduce ARMBench-VL, comprising three benchmarks that assess fine-grained visual grounding (image-level tools), multi-page document understanding (retrieval tools), and instruction following (text-level verification). ARM-Thinker achieves +16.2% average improvement on reward modeling benchmarks, +9.6% on tool-use tasks, and outperforms baselines on multimodal math and logical reasoning benchmarks. Our results demonstrate that agentic capabilities significantly enhance both accuracy and interpretability of reward models.

View Paper