ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use

Mengjie Deng, Guanting Dong, Zhicheng Dou

2025-11-04

ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use

Summary

This paper introduces ToolScope, a new system designed to help multimodal large language models, which can understand both text and images, become better at solving complex problems by using external tools.

What's the problem?

Large language models are great at reasoning and problem-solving, especially when they can use tools like search engines or calculators. However, it's been difficult to get models that handle images *and* text, called multimodal models, to effectively use these tools during their reasoning process, particularly when dealing with lots of visual information. The visual information can get 'lost' or become less useful as the problem gets more complicated.

What's the solution?

The researchers created ToolScope, which works in three main parts. First, a 'Global Navigator' plans the overall strategy. Then, an 'Agentic Executor' actually carries out the plan, using tools like search, code execution, and a new 'Perceive' tool specifically designed to focus on and maintain important visual details. Finally, a 'Response Synthesizer' puts all the reasoning steps together into a clear and understandable answer. The 'Perceive' tool is key because it helps the model not forget important details from images as it works through a problem.

Why it matters?

This work is important because it significantly improves the ability of multimodal AI to tackle complex tasks that require both visual understanding and external knowledge. By making these models better at using tools, it opens the door to more powerful and versatile AI systems that can help with things like scientific reasoning, problem solving involving diagrams, and more, showing a performance boost across several different tests.

Abstract

Recently, large language models (LLMs) have demonstrated remarkable problem-solving capabilities by autonomously integrating with external tools for collaborative reasoning. However, due to the inherently complex and diverse nature of multimodal information, enabling multimodal large language models (MLLMs) to flexibly and efficiently utilize external tools during reasoning remains an underexplored challenge. In this work, we introduce ToolScope, an agentic framework designed to unify global planning with local multimodal perception, adopting a specialized Perceive tool to mitigates visual context degradation in long-horizon VQA task. ToolScope comprises three primary components: the Global Navigator, the Agentic Executor, and the Response Synthesizer. The Global Navigator functions as a "telescope", offering high-level strategic guidance. The Agentic Executor operates iteratively to augment MLLM with local perception through the integration of external tools-Search, Code, and Perceive. Finally, the Response Synthesizer consolidates and organizes the reasoning process into a coherent, user-friendly output. We evaluate ToolScope on four VQA benchmarks across diverse domains, including VQA 2.0, ScienceQA, MAT-Search and MathVista. It demonstrates strong generalization capabilities, achieving an average performance improvement of up to +6.69% across all datasets.

View Paper