PaliGemma 2: A Family of Versatile VLMs for Transfer

Andreas Steiner, André Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy, Shangbang Long, Siyang Qin, Reeve Ingle, Emanuele Bugliarello, Sahar Kazemzadeh, Thomas Mesnard, Ibrahim Alabdulmohsin, Lucas Beyer, Xiaohua Zhai

2024-12-05

PaliGemma 2: A Family of Versatile VLMs for Transfer

Summary

This paper discusses PaliGemma 2, an upgraded version of the PaliGemma vision-language model that enhances the ability of AI to understand and generate information from both images and text.

What's the problem?

As AI technology advances, especially with models that can handle both visual and textual information, there is a growing need for systems that can efficiently transfer knowledge across different tasks. However, existing models often struggle with this transferability, making it difficult to apply them effectively in various applications, such as recognizing objects in images or generating detailed descriptions.

What's the solution?

PaliGemma 2 addresses these challenges by combining a powerful vision encoder with a range of language models. It is trained at different resolutions and sizes, which allows it to adapt to various tasks more effectively. The model incorporates techniques like Task Decomposition and Retrieval-Augmented Generation (RAG) to improve its performance in recognizing and generating information related to images and text. Additionally, it expands its capabilities to include new tasks like recognizing table structures and generating detailed captions for complex images.

Why it matters?

This research is significant because it enhances the versatility and performance of AI systems in understanding visual content alongside text. By improving how these models can transfer knowledge between tasks, PaliGemma 2 opens up new possibilities for applications in fields such as healthcare, education, and creative industries, where accurate interpretation of visual data is crucial.

Abstract

PaliGemma 2 is an upgrade of the PaliGemma open Vision-Language Model (VLM) based on the Gemma 2 family of language models. We combine the SigLIP-So400m vision encoder that was also used by PaliGemma with the whole range of Gemma 2 models, from the 2B one all the way up to the 27B model. We train these models at three resolutions (224px, 448px, and 896px) in multiple stages to equip them with broad knowledge for transfer via fine-tuning. The resulting family of base models covering different model sizes and resolutions allows us to investigate factors impacting transfer performance (such as learning rate) and to analyze the interplay between the type of task, model size, and resolution. We further increase the number and breadth of transfer tasks beyond the scope of PaliGemma including different OCR-related tasks such as table structure recognition, molecular structure recognition, music score recognition, as well as long fine-grained captioning and radiography report generation, on which PaliGemma 2 obtains state-of-the-art results.

View Paper