Should VLMs be Pre-trained with Image Data?
Sedrick Keh, Jean Mercat, Samir Yitzhak Gadre, Kushal Arora, Igor Vasiljevic, Benjamin Burchfiel, Shuran Song, Russ Tedrake, Thomas Kollar, Ludwig Schmidt, Achal Dave
2025-03-11
Summary
This paper talks about whether AI models that understand both images and text should start learning with pictures from the beginning, or add them later after they already know a lot about words. It tests different training schedules to find the best balance.
What's the problem?
When you train AI models to handle both text and images, it’s unclear if mixing pictures into training too early hurts their ability to understand text, or if waiting too long makes them worse at combining both types of data.
What's the solution?
Researchers tested models trained with images added at different stages. They found that adding images late (after 80% of text training) gives a small boost in image-text tasks without hurting text skills, compared to adding images only at the very end.
Why it matters?
This helps build smarter AI assistants and tools that can work with both images and text, like chatbots that describe photos or apps that answer questions about diagrams, without sacrificing their ability to read or write well.
Abstract
Pre-trained LLMs that are further trained with image data perform well on vision-language tasks. While adding images during a second training phase effectively unlocks this capability, it is unclear how much of a gain or loss this two-step pipeline gives over VLMs which integrate images earlier into the training process. To investigate this, we train models spanning various datasets, scales, image-text ratios, and amount of pre-training done before introducing vision tokens. We then fine-tune these models and evaluate their downstream performance on a suite of vision-language and text-only tasks. We find that pre-training with a mixture of image and text data allows models to perform better on vision-language tasks while maintaining strong performance on text-only evaluations. On an average of 6 diverse tasks, we find that for a 1B model, introducing visual tokens 80% of the way through pre-training results in a 2% average improvement over introducing visual tokens to a fully pre-trained model.