SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models
Zhenwei Tang, Difan Jiao, Blair Yang, Ashton Anderson
2025-08-28
Summary
This paper investigates how well vision-language models, which are AI systems that can understand both images and text, actually reason consistently when given the same information in different formats – either as text or as an image.
What's the problem?
It's hard to tell if these models are truly reasoning or just succeeding because the way they're tested for images is different than how they're tested for text. For example, a model might do well on a text question about a math problem, but fail when shown the same problem as an image. This could be because the *task* is different, or because the model knows more from the text version. The researchers wanted a way to compare the models' abilities directly, without these confusing factors.
What's the solution?
The researchers created a new testing set called SEAM. This set includes problems presented in both text *and* image form, but uses special notation systems for both. Instead of simply converting images to text using OCR (which can be inaccurate), they designed unique ways to represent the information visually and textually, ensuring both versions are semantically equivalent – meaning they convey the same information. They then tested 21 different vision-language models on this set to see how well they performed on each format.
Why it matters?
The results showed that models generally perform better on text-based problems than image-based ones, even when the problems are designed to be equally difficult. This suggests a weakness in how these models process visual information. By creating a fair comparison, SEAM helps researchers pinpoint exactly where these models struggle and develop better AI systems that can truly understand and reason about information regardless of how it’s presented.
Abstract
Evaluating whether vision-language models (VLMs) reason consistently across representations is challenging because modality comparisons are typically confounded by task differences and asymmetric information. We introduce SEAM, a benchmark that pairs semantically equivalent inputs across four domains that have existing standardized textual and visual notations. By employing distinct notation systems across modalities, in contrast to OCR-based image-text pairing, SEAM provides a rigorous comparative assessment of the textual-symbolic and visual-spatial reasoning capabilities of VLMs. Across 21 contemporary models, we observe systematic modality imbalance: vision frequently lags language in overall performance, despite the problems containing semantically equivalent information, and cross-modal agreement is relatively low. Our error analysis reveals two main drivers: textual perception failures from tokenization in domain notation and visual perception failures that induce hallucinations. We also show that our results are largely robust to visual transformations. SEAM establishes a controlled, semantically equivalent setting for measuring and improving modality-agnostic reasoning.