Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

Jiaqi Liu, Kaiwen Xiong, Peng Xia, Yiyang Zhou, Haonian Ji, Lu Feng, Siwei Han, Mingyu Ding, Huaxiu Yao

2025-11-26

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

Summary

This paper introduces a new way to improve vision-language AI agents, which are computer programs that can 'see' and 'understand' images and text. The key idea is to let the agent learn and get better all on its own, without needing constant feedback from humans.

What's the problem?

Current AI agents that handle both images and language often struggle because they rely on humans to tell them when they're doing well or poorly. This human feedback is expensive and limits how much the AI can learn. When these agents try to evaluate themselves using just text, they can make mistakes and 'hallucinate' – essentially, think they've done something correctly when they haven't, especially when dealing with complex visual tasks.

What's the solution?

The researchers created an agent called Agent0-VL that uses 'tools' to help it think, check its work, and improve. Imagine it's like giving the agent a calculator or a search engine to verify its reasoning. Agent0-VL has two main parts: a 'Solver' that tries to solve problems using these tools, and a 'Verifier' that uses the tools to check the Solver's work and give it feedback. This happens in a continuous cycle where the agent learns from its own mistakes, guided by the tool-based verification, without any human help.

Why it matters?

This research is important because it shows a path towards creating AI agents that can continually improve their abilities without needing constant human supervision. This is a big step towards more capable and independent AI systems, and the 12.5% improvement in performance on challenging tasks demonstrates the effectiveness of this self-learning approach.

Abstract

Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex visual reasoning steps and often suffers from evaluation hallucinations. To address these challenges, inspired by recent advances in tool-integrated reasoning, we propose Agent0-VL, a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning. Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair, enabling the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. It unifies two synergistic roles within a single LVLM: a Solver that performs multi-turn tool-integrated reasoning, and a Verifier that generates structured feedback and fine-grained self-rewards through tool-grounded critique. These roles interact through a Self-Evolving Reasoning Cycle, where tool-based verification and reinforcement learning jointly align the reasoning and evaluation distributions for stable self-improvement. Through this zero-external-reward evolution, Agent0-VL aligns its reasoning and verification behaviors without any human annotation or external reward models, achieving continual self-improvement. Experiments on geometric problem solving and visual scientific analysis show that Agent0-VL achieves an 12.5% improvement over the base model. Our code is available at https://github.com/aiming-lab/Agent0/Agent0-VL{this https URL}.

View Paper