When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning

Zhengxian Wu, Kai Shi, Chuanrui Zhang, Zirui Liao, Jun Yang, Ni Yang, Qiuying Peng, Luyuan Zhang, Hangrui Xu, Tianhuang Su, Zhenyu Yang, Haonan Lu, Haoqian Wang

2026-03-26

When Models Judge Themselves: Unsupervised Self-Evolution for Multimodal Reasoning

Summary

This paper introduces a new way to train AI models that can reason using both text and images, without needing humans to constantly label data or create perfect examples for the AI to learn from.

What's the problem?

Current AI models that are good at reasoning – like solving math problems based on a picture – usually require a lot of carefully prepared, labeled data, or they learn by imitating a really good 'teacher' AI. Both of these methods are expensive and don't easily scale up to more complex problems or larger datasets. Basically, it's hard to make these AI systems better without a lot of human effort or relying on other complex AI systems.

What's the solution?

The researchers developed a training method called 'Group Relative Policy Optimization' or GRPO. It works by having the AI generate multiple different attempts at solving a problem, then having the AI itself evaluate those attempts. Instead of looking at the absolute score each attempt gets, the AI focuses on which attempts are better *compared to each other* within a group. This 'self-judging' process, combined with a system that continuously adjusts how much weight is given to better attempts, allows the AI to improve its reasoning skills using only unlabeled data. It's like the AI is teaching itself through trial and error and internal feedback.

Why it matters?

This research is important because it offers a way to build more powerful and adaptable AI reasoning systems without the huge costs associated with human labeling or complex teacher models. This makes it more practical to create AI that can solve real-world problems, and it opens the door to AI that can continuously learn and improve on its own, making it a scalable path towards more advanced AI.

Abstract

Recent progress in multimodal large language models has led to strong performance on reasoning tasks, but these improvements largely rely on high-quality annotated data or teacher-model distillation, both of which are costly and difficult to scale. To address this, we propose an unsupervised self-evolution training framework for multimodal reasoning that achieves stable performance improvements without using human-annotated answers or external reward models. For each input, we sample multiple reasoning trajectories and jointly model their within group structure. We use the Actor's self-consistency signal as a training prior, and introduce a bounded Judge based modulation to continuously reweight trajectories of different quality. We further model the modulated scores as a group level distribution and convert absolute scores into relative advantages within each group, enabling more robust policy updates. Trained with Group Relative Policy Optimization (GRPO) on unlabeled data, our method consistently improves reasoning performance and generalization on five mathematical reasoning benchmarks, offering a scalable path toward self-evolving multimodal models. The code are available at https://github.com/OPPO-Mente-Lab/LLM-Self-Judge.

View Paper