Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency
Zhikai Wang, Jiashuo Sun, Wenqi Zhang, Zhiqiang Hu, Xin Li, Fan Wang, Deli Zhao
2025-04-29
Summary
This paper talks about VCBENCH, a new set of tests designed to see how well AI models can solve math problems that require understanding both pictures and text, especially when the answer depends on looking at more than one image.
What's the problem?
The problem is that most math tests for AI only use text or single images, so it's hard to know if these models can handle more complex situations where they have to connect information from multiple visuals to figure out the answer.
What's the solution?
The researchers created VCBENCH, which includes math problems that specifically require the AI to pay attention to relationships between several images as well as the words in the problem. They used this benchmark to test how well current AI models can reason through these kinds of challenges.
Why it matters?
This matters because it helps us see where AI needs to improve in order to solve real-world problems that involve both math and visual information, like reading graphs, analyzing diagrams, or working with data in science and engineering.
Abstract
VCBENCH evaluates multimodal mathematical reasoning by assessing LVLMs on a benchmark with explicit visual dependencies and multi-image reasoning tasks.