VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Ofir Abramovich, Niv Nayman, Sharon Fogel, Inbal Lavi, Ron Litman, Shahar Tsiper, Royee Tichauer, Srikar Appalaraju, Shai Mazor, R. Manmatha

2024-07-22

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Summary

This paper introduces VisFocus, a new method for understanding documents without needing Optical Character Recognition (OCR). It enhances how models analyze visual information by directly linking visual features to language prompts.

What's the problem?

Traditional methods for understanding documents often rely on OCR to extract text, which can be inefficient and may not capture the full context of the document. Additionally, existing models typically process visual information and language separately, which can lead to missed details and less effective understanding, especially in dense documents with a lot of text.

What's the solution?

VisFocus addresses these issues by integrating the visual and language components more closely. Instead of treating the visual information and the text prompt as separate inputs, VisFocus allows the model to focus on specific parts of the document that are relevant to the prompt. It does this by replacing certain layers in the model with new ones that can highlight important sections of the document while ignoring irrelevant parts. The model is also trained with a new method that helps it learn to pay attention to these relevant sections effectively.

Why it matters?

This research is significant because it improves how AI models understand complex documents without relying on OCR. By allowing models to focus on relevant information directly related to user queries, VisFocus can enhance performance in various applications, such as document analysis and information retrieval, making it easier for users to get accurate answers from dense texts.

Abstract

In recent years, notable advancements have been made in the domain of visual document understanding, with the prevailing architecture comprising a cascade of vision and language models. The text component can either be extracted explicitly with the use of external OCR models in OCR-based approaches, or alternatively, the vision model can be endowed with reading capabilities in OCR-free approaches. Typically, the queries to the model are input exclusively to the language component, necessitating the visual features to encompass the entire document. In this paper, we present VisFocus, an OCR-free method designed to better exploit the vision encoder's capacity by coupling it directly with the language prompt. To do so, we replace the down-sampling layers with layers that receive the input prompt and allow highlighting relevant parts of the document, while disregarding others. We pair the architecture enhancements with a novel pre-training task, using language masking on a snippet of the document text fed to the visual encoder in place of the prompt, to empower the model with focusing capabilities. Consequently, VisFocus learns to allocate its attention to text patches pertinent to the provided prompt. Our experiments demonstrate that this prompt-guided visual encoding approach significantly improves performance, achieving state-of-the-art results on various benchmarks.

View Paper