Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO

Lai Wei, Yuting Li, Chen Wang, Yue Wang, Linghe Kong, Weiran Huang, Lichao Sun

2025-05-29

Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO

Summary

This paper talks about MM-UPT, a new framework that helps AI models which work with both text and images get better at reasoning without needing people to label or correct their training data.

What's the problem?

The problem is that multi-modal large language models, which are supposed to understand and reason about both words and pictures, usually need lots of manually labeled data to improve, which takes a lot of time and effort from humans.

What's the solution?

To solve this, the researchers created MM-UPT, which uses a method called GRPO along with self-rewarding, so the model can keep learning and improving on its own by practicing and checking its own answers, all without needing any manual annotations.

Why it matters?

This is important because it means AI systems can become smarter and more capable at understanding both images and text together, without relying on huge amounts of human-labeled data, making AI training faster, cheaper, and more scalable.

Abstract

MM-UPT, a framework employing GRPO and self-rewarding, enhances multi-modal LLMs through unsupervised continual learning, showing performance improvements without manual annotations.

View Paper