MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning

Jiawei Chen, Xintian Shen, Lihao Zheng, Zhenwei Shao, Hongyuan Zhang, Pengfei Yu, Xudong Rao, Ning Mao, Xiaobo Liu, Lian Wen, Chaoqun Du, Feng Gu, Wei He, Qizhen Li, Shanshan Li, Zide Liu, Jing Luo, Lifu Mu, Xuhao Pan, Chang Ren, Haoyi Sun, Qian Wang

2026-01-07

MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning

Summary

This paper introduces MindWatcher, a new type of AI agent designed to solve complex problems by thinking through them and using tools like search engines and image databases on its own, without needing constant instructions from people.

What's the problem?

Traditional AI agents struggle with real-world tasks that require them to use different tools in a smart order to get things done. They often need a pre-defined workflow, meaning someone has to tell them exactly what to do step-by-step. This limits their ability to handle unexpected situations or complex problems that require flexible thinking and tool use.

What's the solution?

The researchers created MindWatcher, which uses a method called 'interleaved thinking' – it can switch between thinking about the problem and using tools whenever it needs to. It also uses 'multimodal chain-of-thought reasoning,' meaning it can process both text *and* images to help it reason and find better information. They built a special database of images and tools for MindWatcher to use, and also created a way to automatically test and improve its performance. They also figured out a way to train it more efficiently.

Why it matters?

MindWatcher is important because it shows a significant step forward in creating AI agents that can truly think for themselves and solve complex problems independently. It outperforms other similar agents, even those that are much larger, and provides valuable insights into how to best train these kinds of AI systems. This could lead to AI that can help us with a wider range of tasks, from research to everyday problem-solving.

Abstract

Traditional workflow-based agents exhibit limited intelligence when addressing real-world problems requiring tool invocation. Tool-integrated reasoning (TIR) agents capable of autonomous reasoning and tool invocation are rapidly emerging as a powerful approach for complex decision-making tasks involving multi-step interactions with external environments. In this work, we introduce MindWatcher, a TIR agent integrating interleaved thinking and multimodal chain-of-thought (CoT) reasoning. MindWatcher can autonomously decide whether and how to invoke diverse tools and coordinate their use, without relying on human prompts or workflows. The interleaved thinking paradigm enables the model to switch between thinking and tool calling at any intermediate stage, while its multimodal CoT capability allows manipulation of images during reasoning to yield more precise search results. We implement automated data auditing and evaluation pipelines, complemented by manually curated high-quality datasets for training, and we construct a benchmark, called MindWatcher-Evaluate Bench (MWE-Bench), to evaluate its performance. MindWatcher is equipped with a comprehensive suite of auxiliary reasoning tools, enabling it to address broad-domain multimodal problems. A large-scale, high-quality local image retrieval database, covering eight categories including cars, animals, and plants, endows model with robust object recognition despite its small size. Finally, we design a more efficient training infrastructure for MindWatcher, enhancing training speed and hardware utilization. Experiments not only demonstrate that MindWatcher matches or exceeds the performance of larger or more recent models through superior tool invocation, but also uncover critical insights for agent training, such as the genetic inheritance phenomenon in agentic RL.

View Paper