Active-O3: Empowering Multimodal Large Language Models with Active Perception via GRPO

Muzhi Zhu, Hao Zhong, Canyu Zhao, Zongze Du, Zheng Huang, Mingyu Liu, Hao Chen, Cheng Zou, Jingdong Chen, Ming Yang, Chunhua Shen

2025-05-28

Active-O3: Empowering Multimodal Large Language Models with Active
Perception via GRPO

Summary

This paper talks about ACTIVE-O3, a new method that helps advanced AI models not just see images or videos, but also decide where to look and what to focus on, kind of like how humans pay attention to important details when solving problems.

What's the problem?

The problem is that most multimodal AI models, which work with both images and text, aren't very good at actively choosing the most useful parts of what they see. They often waste time and computer power looking at everything, and they can miss small or important details, especially in tasks like finding tiny objects or understanding complex scenes.

What's the solution?

The researchers created ACTIVE-O3, which uses reinforcement learning to train these AI models to make smarter decisions about where to look and what to pay attention to. They tested this approach on a variety of tasks, like spotting small objects in crowded scenes or analyzing images for self-driving cars, and showed that their method made the models more accurate and efficient.

Why it matters?

This matters because it means AI can become much better at understanding complicated images and videos, which is important for things like robotics, autonomous vehicles, and any technology where seeing and making decisions quickly and accurately is critical.

Abstract

A reinforcement learning framework, ACTIVE-O3, is proposed to equip Multimodal Large Language Models with active perception capabilities and tested across various tasks and benchmarks.

View Paper