ProactiveBench: Benchmarking Proactiveness in Multimodal Large Language Models
Thomas De Min, Subhankar Roy, Stéphane Lathuilière, Elisa Ricci, Massimiliano Mancini
2026-03-23
Summary
This paper investigates whether large multimodal models (MLLMs), which can understand both images and text, can proactively ask for help when they're unsure about something, similar to how humans do.
What's the problem?
MLLMs are usually passive; they try to answer questions even when they lack the necessary information, like if an object is hidden from view. The researchers noticed that humans naturally ask for assistance in these situations, and wanted to see if MLLMs could learn to do the same. Essentially, the problem is that current MLLMs don't know *when* they don't know, and don't ask for clarification or intervention.
What's the solution?
The researchers created a new benchmark called ProactiveBench, which includes tasks designed to test if MLLMs will ask for help. These tasks involve things like identifying hidden objects, improving blurry images, or understanding rough drawings. They tested 22 different MLLMs on this benchmark and found that most of them didn't proactively ask for help. They also tried giving the models hints and using conversation history, but these didn't significantly improve performance. Finally, they used a technique called reinforcement learning to *train* a model to be more proactive, and found that it could learn to ask for help, even in new situations.
Why it matters?
This research is important because truly intelligent AI should be able to recognize its own limitations and ask for help when needed. Building proactive models could lead to more reliable and helpful AI systems that can better collaborate with humans. It shows that proactiveness isn't just about how powerful a model is, but about teaching it *when* to seek assistance, and provides a benchmark for future research in this area.
Abstract
Effective collaboration begins with knowing when to ask for help. For example, when trying to identify an occluded object, a human would ask someone to remove the obstruction. Can MLLMs exhibit a similar "proactive" behavior by requesting simple user interventions? To investigate this, we introduce ProactiveBench, a benchmark built from seven repurposed datasets that tests proactiveness across different tasks such as recognizing occluded objects, enhancing image quality, and interpreting coarse sketches. We evaluate 22 MLLMs on ProactiveBench, showing that (i) they generally lack proactiveness; (ii) proactiveness does not correlate with model capacity; (iii) "hinting" at proactiveness yields only marginal gains. Surprisingly, we found that conversation histories and in-context learning introduce negative biases, hindering performance. Finally, we explore a simple fine-tuning strategy based on reinforcement learning: its results suggest that proactiveness can be learned, even generalizing to unseen scenarios. We publicly release ProactiveBench as a first step toward building proactive multimodal models.