ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation
Yongyuan Liang, Wei Chow, Feng Li, Ziqiao Ma, Xiyao Wang, Jiageng Mao, Jiuhai Chen, Jiatao Gu, Yue Wang, Furong Huang
2025-11-04
Summary
This paper introduces a new way to test how well artificial intelligence models can truly understand and connect information from both images and text, going beyond just processing each separately.
What's the problem?
Current AI models that handle both images and text are often tested on tasks that really only require understanding *either* images *or* text, not both together. For example, a model might be good at describing a picture, or answering questions about a story, but not at using the text to improve the image it creates, or using an image to help it reason through a question. This means we don't know how well these models can actually use information from one type of data to enhance their understanding of the other.
What's the solution?
The researchers created a benchmark called ROVER, which includes over 1300 tasks designed to specifically test 'reciprocal cross-modal reasoning'. This means testing if a model can use text to guide image creation, and if it can create images to help it think through and answer questions. They tested 17 different AI models on these tasks, looking at how well they could combine information from both sources.
Why it matters?
The results show that the ability to reason across images and text is crucial for creating truly intelligent AI. Models that can effectively use one type of information to improve the other perform much better, and current models struggle with tasks that require abstract visual thinking, even if they're good at basic image recognition. This highlights the need to focus on developing AI that can seamlessly integrate and reason across different types of data.
Abstract
Unified multimodal models (UMMs) have emerged as a powerful paradigm for seamlessly unifying text and image understanding and generation. However, prevailing evaluations treat these abilities in isolation, such that tasks with multimodal inputs and outputs are scored primarily through unimodal reasoning, i.e., textual benchmarks emphasize language-based reasoning, while visual benchmarks emphasize reasoning outcomes manifested in the pixels. We introduce ROVER to address this pressing need to test reciprocal cross-modal reasoning, the use of one modality to guide, verify, or refine outputs in the other, an ability central to the vision of unified multimodal intelligence. ROVER is a human-annotated benchmark that explicitly targets reciprocal cross-modal reasoning, which contains 1312 tasks grounded in 1876 images, spanning two complementary settings. Verbally-augmented reasoning for visual generation evaluates whether models can use verbal prompts and reasoning chains to guide faithful image synthesis. Visually-augmented reasoning for verbal generation evaluates whether models can generate intermediate visualizations that strengthen their own reasoning processes for question answering. Experiments on 17 unified models reveal two key findings: (i) Cross-modal reasoning determines visual generation quality, with interleaved models significantly outperforming non-interleaved ones; notably, combining strong unimodal models fails to achieve comparable reasoning. (ii) Models show dissociation between physical and symbolic reasoning: they succeed at interpreting perceptual concepts literally but fail to construct visual abstractions for symbolic tasks, where faulty reasoning harms performance. These results highlight reciprocal cross-modal reasoning as a critical frontier for enabling true omnimodal generation.