SO-Bench: A Structural Output Evaluation of Multimodal LLMs

Di Feng, Kaixin Ma, Feng Nan, Haofeng Chen, Bohan Zhai, David Griffiths, Mingfei Gao, Zhe Gan, Eshan Verma, Yinfei Yang, Zhifeng Chen, Afshin Dehghan

2025-12-01

SO-Bench: A Structural Output Evaluation of Multimodal LLMs

Summary

This paper investigates how well current AI models that can 'see' and 'understand' language (called Multimodal Large Language Models or MLLMs) can extract information from images and present it in a specific, organized format like a JSON file.

What's the problem?

While these AI models are getting better at understanding both images and text, there hasn't been a good way to test how well they can pull specific details *from* images and organize that information according to a pre-defined structure, like a database schema. Existing tests focus mostly on text-based organization, not visual information. This means we don't really know how reliable these models are when we need them to, for example, automatically fill out a form based on a screenshot.

What's the solution?

The researchers created a new benchmark called SO-Bench. This benchmark includes over 6,500 different ways to structure data (JSON schemas) and 1,800 image-schema pairs where humans have verified the correct information to extract. They then tested several AI models, both publicly available and cutting-edge proprietary ones, using this benchmark. They also experimented with training a model to improve its ability to produce structured outputs.

Why it matters?

This work is important because as AI models become more integrated into everyday tasks, like automating workflows or interacting with apps, they need to be able to reliably extract information from visual sources and present it in a usable format. This research highlights the current limitations of these models and provides a tool for improving their ability to handle these kinds of tasks, ultimately making them more helpful and trustworthy.

Abstract

Multimodal large language models (MLLMs) are increasingly deployed in real-world, agentic settings where outputs must not only be correct, but also conform to predefined data schemas. Despite recent progress in structured generation in textual domain, there is still no benchmark that systematically evaluates schema-grounded information extraction and reasoning over visual inputs. In this work, we conduct a comprehensive study of visual structural output capabilities for MLLMs with our carefully designed SO-Bench benchmark. Covering four visual domains, including UI screens, natural images, documents, and charts, SO-Bench is built from over 6.5K diverse JSON schemas and 1.8K curated image-schema pairs with human-verified quality. Benchmarking experiments on open-sourced and frontier proprietary models reveal persistent gaps in predicting accurate, schema compliant outputs, highlighting the need for better multimodal structured reasoning. Beyond benchmarking, we further conduct training experiments to largely improve the model's structured output capability. We plan to make the benchmark available to the community.

View Paper