VCR: Visual Caption Restoration

Tianyu Zhang, Suyuchen Wang, Lu Li, Ge Zhang, Perouz Taslakian, Sai Rajeswar, Jie Fu, Bang Liu, Yoshua Bengio

2024-06-13

Summary

This paper introduces Visual Caption Restoration (VCR), a new task that challenges AI models to restore text that is partially hidden in images using visual clues. The goal is to improve how machines understand and interpret text embedded within images.

What's the problem?

Text embedded in images is different from regular text because it requires a combination of visual understanding and language processing. Traditional methods for handling text in images often rely on optical character recognition (OCR), which can struggle when the text is obscured. This makes it difficult for AI models to accurately restore the hidden text, as they need to consider both the image context and the visible parts of the text.

What's the solution?

To tackle this problem, the authors created a dataset called VCR-Wiki, which includes 2.11 million English and 346,000 Chinese examples of images with captions. They developed a method to generate synthetic images with adjustable caption visibility, allowing them to control how much of the text is hidden. This dataset helps train AI models to better understand the relationship between visual content and embedded text. Despite these efforts, the results showed that existing vision-language models still perform poorly compared to human capabilities in restoring the obscured text.

Why it matters?

This research is important because it highlights the challenges in teaching AI systems to accurately interpret and restore text within images. By releasing the VCR-Wiki dataset and the methods used to create it, the authors aim to encourage further research and development of more advanced models that can effectively bridge the gap between visual understanding and language processing.

Abstract

We introduce Visual Caption Restoration (VCR), a novel vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images. This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedded in images. While numerous works have integrated text embedded in images into visual question-answering tasks, approaches to these tasks generally rely on optical character recognition or masked language modeling, thus reducing the task to mainly text-based processing. However, text-based processing becomes ineffective in VCR as accurate text restoration depends on the combined information from provided images, context, and subtle cues from the tiny exposed areas of masked texts. We develop a pipeline to generate synthetic images for the VCR task using image-caption pairs, with adjustable caption visibility to control the task difficulty. With this pipeline, we construct a dataset for VCR called VCR-Wiki using images with captions from Wikipedia, comprising 2.11M English and 346K Chinese entities in both easy and hard split variants. Our results reveal that current vision language models significantly lag behind human performance in the VCR task, and merely fine-tuning the models on our dataset does not lead to notable improvements. We release VCR-Wiki and the data construction code to facilitate future research.

View Paper