Let ViT Speak: Generative Language-Image Pre-training

Yan Fang, Mengcheng Lan, Zilong Huang, Weixian Lei, Yunqing Zhao, Yujie Zhong, Yingchen Yu, Qi She, Yao Zhao, Yunchao Wei

2026-05-04

Let ViT Speak: Generative Language-Image Pre-training

Summary

This paper introduces a new way to train the vision part of AI systems that understand both images and text, called Generative Language-Image Pre-training, or GenLIP.

What's the problem?

Current methods for getting computers to 'see' and understand images well enough to work with large language models are often complicated and require a lot of data or special techniques to compare images and text. They don't always take advantage of the way language models naturally work, which is by predicting the next word in a sequence.

What's the solution?

GenLIP simplifies things by training the image processing part, a Vision Transformer, to directly predict text tokens from image tokens, just like a language model predicts the next word. It uses a single system to handle both images and text, avoiding the need for complex comparisons or extra text-generating components. They trained this system on a huge dataset of images and text and then further refined it with images of different sizes and shapes.

Why it matters?

This research is important because it offers a simpler and more efficient way to build AI systems that can understand both images and text. GenLIP performs as well as, or even better than, existing methods while using less training data, and it excels at tasks requiring attention to detail like reading text within images or understanding charts. This makes it a promising foundation for future multimodal AI models.

Abstract

In this paper, we present Generative Language-Image Pre-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) Simplicity: a single transformer jointly models visual and textual tokens; (2) Scalability: it scales effectively with both data and model size; and (3) Performance: it achieves competitive or superior results across diverse multimodal benchmarks. Trained on 8B samples from Recap-DataComp-1B, GenLIP matches or surpasses strong baselines despite using substantially less pretraining data. After continued pretraining on multi-resolution images at native aspect ratios, GenLIP further improves on detail-sensitive tasks such as OCR and chart understanding, making it a strong foundation for vision encoders in MLLMs.

View Paper