Thinking with Images via Self-Calling Agent

Wenxi Yang, Yuzhong Zhao, Fang Wan, Qixiang Ye

2025-12-12

Thinking with Images via Self-Calling Agent

Summary

This paper introduces a new method called Self-Calling Chain-of-Thought, or sCoT, to improve how AI systems reason about images. It builds on existing techniques that combine visual information with step-by-step thinking, but makes the process more efficient and effective.

What's the problem?

Current AI systems that 'think with images' and use a step-by-step reasoning process are hard to train. They need a lot of good examples of how to correctly reason about images, and training them can take a very long time and require a lot of computing power. The process of combining image understanding with reasoning steps is complex and doesn't scale well.

What's the solution?

The researchers came up with sCoT, which simplifies things by turning the image reasoning problem into a language-only problem. Instead of constantly switching between looking at the image and thinking, the system breaks down the task into smaller, independent sub-tasks. It then uses multiple 'virtual copies' of itself – essentially, different parts of the AI – to solve each sub-task separately. This avoids the need for constant back-and-forth between image and text, making training much faster and more efficient. They also used a special technique to help the AI learn the best way to reason.

Why it matters?

This research is important because it makes it easier to build AI systems that can understand and reason about images. By reducing the training time and computational cost, sCoT opens the door to more complex visual reasoning tasks and could lead to improvements in areas like robotics, image search, and visual question answering. It shows a promising way to make AI more efficient at handling visual information.

Abstract

Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reasoning data. In this study, we propose Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling. Specifically, a main agent decomposes the complex visual reasoning task to atomic subtasks and invokes its virtual replicas, i.e. parameter-sharing subagents, to solve them in isolated context. sCoT enjoys substantial training effectiveness and efficiency, as it requires no explicit interleaving between modalities. sCoT employs group-relative policy optimization to reinforce effective reasoning behavior to enhance optimization. Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to 1.9% with sim 75% fewer GPU hours compared to strong baseline approaches. Code is available at https://github.com/YWenxi/think-with-images-through-self-calling.

View Paper