VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?
Hongbo Zhao, Meng Wang, Fei Zhu, Wenzhuo Liu, Bolin Ni, Fanhu Zeng, Gaofeng Meng, Zhaoxiang Zhang
2025-12-18
Summary
This paper investigates how well vision-language models (VLMs) can understand information when long texts are compressed into visual formats, a technique used to allow these models to process much larger amounts of text.
What's the problem?
Large language models struggle to handle very long inputs due to the massive computing power and memory needed. A solution called vision-text compression (VTC) turns text into images to reduce the amount of data the model needs to process, but it wasn't clear if this compression negatively impacted the model's ability to actually *understand* the long-form information, especially when it requires connecting ideas across the entire text.
What's the solution?
The researchers created a new set of tests, called VTCBench, to specifically measure how well VLMs understand information presented in this compressed visual format. These tests focused on three key areas: retrieving specific facts, making inferences based on the text, and answering questions based on a long conversation history. They tested several different models, both publicly available and proprietary, using both standard and more realistic, varied inputs.
Why it matters?
The results showed that even though models can 'read' the compressed images, they often struggle to grasp the overall meaning and relationships within the long texts. This research highlights a critical weakness in current VLMs and provides a foundation for developing better methods for handling long-context information efficiently and accurately, ultimately leading to more powerful and scalable AI systems.
Abstract
The computational and memory overheads associated with expanding the context window of LLMs severely limit their scalability. A noteworthy solution is vision-text compression (VTC), exemplified by frameworks like DeepSeek-OCR and Glyph, which convert long texts into dense 2D visual representations, thereby achieving token compression ratios of 3x-20x. However, the impact of this high information density on the core long-context capabilities of vision-language models (VLMs) remains under-investigated. To address this gap, we introduce the first benchmark for VTC and systematically assess the performance of VLMs across three long-context understanding settings: VTC-Retrieval, which evaluates the model's ability to retrieve and aggregate information; VTC-Reasoning, which requires models to infer latent associations to locate facts with minimal lexical overlap; and VTC-Memory, which measures comprehensive question answering within long-term dialogue memory. Furthermore, we establish the VTCBench-Wild to simulate diverse input scenarios.We comprehensively evaluate leading open-source and proprietary models on our benchmarks. The results indicate that, despite being able to decode textual information (e.g., OCR) well, most VLMs exhibit a surprisingly poor long-context understanding ability with VTC-compressed information, failing to capture long associations or dependencies in the context.This study provides a deep understanding of VTC and serves as a foundation for designing more efficient and scalable VLMs.