RBench-V: A Primary Assessment for Visual Reasoning Models with Multi-modal Outputs

Meng-Hao Guo, Xuanyu Chu, Qianrui Yang, Zhe-Han Mo, Yiqing Shen, Pei-lin Li, Xinjie Lin, Jinnian Zhang, Xin-Sheng Chen, Yi Zhang, Kiyohiro Nakayama, Zhengyang Geng, Houwen Peng, Han Hu, Shi-Nin Hu

2025-05-26

RBench-V: A Primary Assessment for Visual Reasoning Models with
Multi-modal Outputs

Summary

This paper talks about RBench-V, a new test designed to see how well AI models can reason and solve problems using both images and text, especially when they need to create or change images as part of their answer.

What's the problem?

The problem is that while some AI models can handle text and images separately, they often have trouble when they need to combine both skills, like understanding a picture and then drawing or editing something based on what they understand.

What's the solution?

The researchers made RBench-V, a benchmark that challenges these models with tasks that require them to use images in their reasoning, such as manipulating pictures or drawing helpful lines, to see how well they can handle multi-modal outputs.

Why it matters?

This is important because it shows where current AI models are still weak when it comes to combining vision and language, helping researchers know what to improve so future models can be more useful in areas like education, design, and problem-solving.

Abstract

A benchmark called RBench-V evaluates multi-modal models' vision-indispensable reasoning through image manipulation and auxiliary line construction, demonstrating that current models struggle with multi-modal outputs.

View Paper