VisR-Bench: An Empirical Study on Visual Retrieval-Augmented Generation for Multilingual Long Document Understanding
Jian Chen, Ming Li, Jihyung Kil, Chenguang Wang, Tong Yu, Ryan Rossi, Tianyi Zhou, Changyou Chen, Ruiyi Zhang
2025-08-12
Summary
This paper talks about VisR-Bench, a new testing system made to check how well AI models can find and understand information in long documents that include text, images, and tables, especially in many different languages. It focuses on question-driven retrieval, meaning the AI is given questions and needs to find the right parts of documents to answer them.
What's the problem?
The problem is that most existing tests for document understanding either only work in English or only look at very short documents or single pages. They don't handle long documents in multiple languages well, and they often allow AI systems to just match keywords without truly understanding the content. This makes it hard to fairly evaluate how well AI retrieves and understands complex, multimodal information across languages.
What's the solution?
The researchers created VisR-Bench, which includes over 35,000 high-quality question and answer pairs drawn from about 1,200 long documents in sixteen different languages. The questions cover different types, including about pictures, tables, and text parts. Importantly, some queries don’t have direct answers in the documents, so models can't just rely on easy keyword matching. They tested many AI models like text-based, multimodal, and large multilingual vision-language models to see how they perform on this challenge.
Why it matters?
This matters because many important documents worldwide are long and contain pictures, tables, and text in different languages. Having a good benchmark like VisR-Bench helps researchers understand which AI models work best for these kinds of documents and where improvements are needed. This leads to smarter AI that can help with document analysis and answering questions from complex information in real-world multilingual and multimedia settings.
Abstract
VisR-Bench is a multilingual benchmark for question-driven multimodal retrieval in long documents, evaluating various models across different languages and question types.