Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation
Yue Yang, Ajay Patel, Matt Deitke, Tanmay Gupta, Luca Weihs, Andrew Head, Mark Yatskar, Chris Callison-Burch, Ranjay Krishna, Aniruddha Kembhavi, Christopher Clark
2025-02-21
Summary
This paper talks about CoSyn, a new system that helps AI understand images with lots of text in them, like charts and documents. It does this by using AI to create a huge set of practice examples for other AI models to learn from.
What's the problem?
AI models that work with both images and text (called VLMs) have a hard time understanding images with lots of text in them. This is because there aren't enough good examples of these kinds of images for the AI to learn from. It's like trying to learn a new language without enough textbooks or practice materials.
What's the solution?
The researchers created CoSyn, which uses an AI that's good at writing code to make fake but realistic images with text. It works by taking a simple description (like 'make a nutrition label') and turning it into computer code that can create an image. CoSyn then uses this code to make both the image and instructions about the image. They made 400,000 images and 2.7 million instructions this way, giving AI models a ton of practice material.
Why it matters?
This matters because it helps AI get much better at understanding complex images with text, which is important for things like reading charts, understanding documents, or even helping robots interact with the real world. The AI trained with CoSyn's fake images did better than some of the most advanced AI systems out there, including ones from big tech companies. This could lead to smarter AI assistants, better document processing systems, and more advanced robots that can understand visual information in the real world.
Abstract
Reasoning about images with rich text, such as charts and documents, is a critical application of vision-language models (VLMs). However, VLMs often struggle in these domains due to the scarcity of diverse text-rich vision-language data. To address this challenge, we present CoSyn, a framework that leverages the coding capabilities of text-only large language models (LLMs) to automatically create synthetic text-rich multimodal data. Given input text describing a target domain (e.g., "nutrition fact labels"), CoSyn prompts an LLM to generate code (Python, HTML, LaTeX, etc.) for rendering synthetic images. With the underlying code as textual representations of the synthetic images, CoSyn can generate high-quality instruction-tuning data, again relying on a text-only LLM. Using CoSyn, we constructed a dataset comprising 400K images and 2.7M rows of vision-language instruction-tuning data. Comprehensive experiments on seven benchmarks demonstrate that models trained on our synthetic data achieve state-of-the-art performance among competitive open-source models, including Llama 3.2, and surpass proprietary models such as GPT-4V and Gemini 1.5 Flash. Furthermore, CoSyn can produce synthetic pointing data, enabling VLMs to ground information within input images, showcasing its potential for developing multimodal agents capable of acting in real-world environments.