Building and better understanding vision-language models: insights and future directions
Hugo Laurençon, Andrés Marafioti, Victor Sanh, Léo Tronchon
2024-08-26

Summary
This paper discusses how to build and improve vision-language models (VLMs), which are systems that can understand both images and text, and provides insights into their future development.
What's the problem?
The development of VLMs is still in its early stages, and there is no clear agreement on the best methods for training these models, including what data to use and how to structure the models. This lack of consensus makes it harder for researchers to build effective systems.
What's the solution?
The authors provide a detailed overview of current techniques for building VLMs, pointing out their strengths and weaknesses. They also present a new model called Idefics3-8B, which is an improvement over an earlier version, Idefics2-8B. This new model is trained using a much larger dataset called Docmatix, which helps it understand documents better. The paper includes practical steps for building this model efficiently using open datasets.
Why it matters?
This research is important because it helps standardize the process of creating VLMs, making it easier for others in the field to develop similar systems. By sharing insights and resources, the authors aim to advance the technology further, which can lead to better applications in areas like image recognition, automated translation, and more.
Abstract
The field of vision-language models (VLMs), which take images and texts as inputs and output texts, is rapidly evolving and has yet to reach consensus on several key aspects of the development pipeline, including data, architecture, and training methods. This paper can be seen as a tutorial for building a VLM. We begin by providing a comprehensive overview of the current state-of-the-art approaches, highlighting the strengths and weaknesses of each, addressing the major challenges in the field, and suggesting promising research directions for underexplored areas. We then walk through the practical steps to build Idefics3-8B, a powerful VLM that significantly outperforms its predecessor Idefics2-8B, while being trained efficiently, exclusively on open datasets, and using a straightforward pipeline. These steps include the creation of Docmatix, a dataset for improving document understanding capabilities, which is 240 times larger than previously available datasets. We release the model along with the datasets created for its training.