MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data

Zongxia Li, Hongyang Du, Chengsong Huang, Xiyang Wu, Lantao Yu, Yicheng He, Jing Xie, Xiaomin Wu, Zhichao Liu, Jiarui Zhang, Fuxiao Liu

2026-03-11

MM-Zero: Self-Evolving Multi-Model Vision Language Models From Zero Data

Summary

This paper introduces a new method for improving Vision Language Models (VLMs), which are AI systems that can understand both images and text, without needing a lot of initial training data from humans.

What's the problem?

Typically, to get a VLM to learn and improve on its own, you need to give it some starting examples – like a bunch of images to begin with. This is because VLMs deal with two types of information, visuals and language, making it harder to start from absolutely nothing compared to models that only work with text. The challenge was to create a system where a VLM could improve its reasoning abilities without *any* pre-existing image data.

What's the solution?

The researchers developed a system called MM-Zero that uses a process similar to evolution, but for AI. Instead of humans teaching the model, the model teaches itself. It does this by creating three specialized 'roles' within the same AI system: a 'Proposer' that comes up with ideas for visual concepts and questions, a 'Coder' that turns those ideas into actual images using code, and a 'Solver' that tries to answer questions about the images. These roles all learn together, getting better by receiving feedback on whether their actions are successful. They use a specific training technique called Group Relative Policy Optimization to help them learn efficiently.

Why it matters?

This work is important because it shows that VLMs can be significantly improved without relying on large amounts of labeled data. This makes it easier and cheaper to develop and deploy these powerful AI systems, and it opens the door to creating even more complex AI systems that can learn and adapt on their own, going beyond the usual approach of having two separate AI models working together.

Abstract

Self-evolving has emerged as a key paradigm for improving foundational models such as Large Language Models (LLMs) and Vision Language Models (VLMs) with minimal human intervention. While recent approaches have demonstrated that LLM agents can self-evolve from scratch with little to no data, VLMs introduce an additional visual modality that typically requires at least some seed data, such as images, to bootstrap the self-evolution process. In this work, we present Multi-model Multimodal Zero (MM-Zero), the first RL-based framework to achieve zero-data self-evolution for VLM reasoning. Moving beyond prior dual-role (Proposer and Solver) setups, MM-Zero introduces a multi-role self-evolving training framework comprising three specialized roles: a Proposer that generates abstract visual concepts and formulates questions; a Coder that translates these concepts into executable code (e.g., Python, SVG) to render visual images; and a Solver that performs multimodal reasoning over the generated visual content. All three roles are initialized from the same base model and trained using Group Relative Policy Optimization (GRPO), with carefully designed reward mechanisms that integrate execution feedback, visual verification, and difficulty balancing. Our experiments show that MM-Zero improves VLM reasoning performance across a wide range of multimodal benchmarks. MM-Zero establishes a scalable path toward self-evolving multi-model systems for multimodal models, extending the frontier of self-improvement beyond the conventional two-model paradigm.

View Paper