VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding

Ignacio de Rodrigo, Alvaro J. Lopez-Lopez, Jaime Boal

2026-01-09

VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding

Summary

This paper introduces a new method called VERSE that helps us understand how Vision-Language Models – which are AI systems that can 'see' images and understand text – work when dealing with complex documents like forms or charts.

What's the problem?

Vision-Language Models aren't always great at understanding visually rich documents. It's hard to know *why* they make mistakes. Are they confused by certain types of images, fonts, or layouts? Without knowing the root cause, it's difficult to improve their performance. Essentially, these models are like a 'black box' and we need a way to peek inside and see what's going on.

What's the solution?

VERSE lets researchers visualize the way these models 'see' and process images within documents. It maps out the visual information the model focuses on, allowing us to identify specific areas or features that cause problems. Then, VERSE helps create new, artificial training examples that focus on those tricky areas, essentially 'teaching' the model to handle them better. The researchers tested this by improving a model on a synthetic dataset and then seeing if it performed better on real-world documents.

Why it matters?

This work is important because it shows we can significantly improve the accuracy of document understanding AI, even with models that run on regular computers, to the point where they can compete with – and sometimes even outperform – powerful cloud-based AI services. This means better automation for tasks like processing invoices, understanding medical records, or extracting data from reports, and it doesn't necessarily require expensive resources.

Abstract

This work introduces VERSE, a methodology for analyzing and improving Vision-Language Models applied to Visually-rich Document Understanding by exploring their visual embedding space. VERSE enables the visualization of latent representations, supporting the assessment of model feasibility. It also facilitates the identification of problematic regions and guides the generation of synthetic data to enhance performance in those clusters. We validate the methodology by training on the synthetic MERIT Dataset and evaluating on its real-world counterpart, MERIT Secret. Results show that VERSE helps uncover the visual features associated with error-prone clusters, and that retraining with samples containing these features substantially boosts F1 performance without degrading generalization. Furthermore, we demonstrate that on-premise models such as Donut and Idefics2, when optimized with VERSE, match or even surpass the performance of SaaS solutions like GPT-4 and Pixtral.

View Paper