Towards Visual Text Grounding of Multimodal Large Language Model

Ming Li, Ruiyi Zhang, Jian Chen, Jiuxiang Gu, Yufan Zhou, Franck Dernoncourt, Wanrong Zhu, Tianyi Zhou, Tong Sun

2025-04-11

Towards Visual Text Grounding of Multimodal Large Language Model

Summary

This paper talks about how current AI models that can handle both images and text still have trouble when it comes to understanding and connecting text inside complicated document images, like forms or infographics. The researchers introduce a new task and a special dataset to help test and improve how well these AI models can find and use the right pieces of text from these kinds of images when answering questions.

What's the problem?

The problem is that most existing AI models and tests focus on images with simple objects, not on images packed with lots of text and complex layouts, like documents. Because of this, it's hard to tell how good these models really are at understanding and grounding their answers in the actual text found in documents, which is important for tasks like reading scanned forms or analyzing charts.

What's the solution?

To solve this, the authors created TRIG, a new task and benchmark designed specifically for text-rich document images. They built a pipeline that uses advanced text recognition tools and AI to pick out which parts of the image support the answer to a question, and then had humans double-check the results. They also made a huge training set using synthetic data to help AI models get better at this skill, and they tested different ways to train the models so they could improve at grounding their answers in the right parts of the image.

Why it matters?

This work matters because it fills a big gap in how we test and train AI to work with real-world documents. By making it easier to measure and improve how well AI models connect their answers to the actual text in complex images, this research helps move us closer to AI that can reliably read and understand documents, which is useful for everything from business paperwork to education.

Abstract

Despite the existing evolution of Multimodal Large Language Models (MLLMs), a non-neglectable limitation remains in their struggle with visual text grounding, especially in text-rich images of documents. Document images, such as scanned forms and infographics, highlight critical challenges due to their complex layouts and textual content. However, current benchmarks do not fully address these challenges, as they mostly focus on visual grounding on natural images, rather than text-rich document images. Thus, to bridge this gap, we introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking and improving the Text-Rich Image Grounding capabilities of MLLMs in document question-answering. Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark and a large-scale training set of 90$ synthetic data based on four diverse datasets. A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images. In addition, we propose two simple and effective TRIG methods based on general instruction tuning and plug-and-play efficient embedding, respectively. By finetuning MLLMs on our synthetic dataset, they promisingly improve spatial reasoning and grounding capabilities.

View Paper