Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Jiuhai Chen, Jianwei Yang, Haiping Wu, Dianqi Li, Jianfeng Gao, Tianyi Zhou, Bin Xiao

2024-12-06

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Summary

This paper presents Florence-VL, a new type of multimodal large language model that improves how AI understands and generates content by combining advanced visual features with language processing.

What's the problem?

Current vision-language models often struggle to effectively integrate visual information from images with text. They typically rely on older methods that don't capture the full range of visual details, which limits their ability to perform well on various tasks that require understanding both images and text.

What's the solution?

Florence-VL addresses this issue by using a generative vision model called Florence-2, which can capture a wider variety of visual features. The researchers developed a new method called 'depth-breadth fusion' (DBFusion) to combine visual features from different levels of detail and various prompts. This allows the model to use both fine details and broader context when interpreting images. The training process involves pretraining the entire model and then fine-tuning it with diverse datasets that include high-quality image captions and instructions.

Why it matters?

This research is important because it enhances the capabilities of AI in understanding complex visual and textual information together. By improving how these models work, Florence-VL can lead to better performance in tasks like visual question answering, object recognition, and more. This has significant implications for applications in fields such as education, entertainment, and automated content creation, where accurate interpretation of both images and text is crucial.

Abstract

We present Florence-VL, a new family of multimodal large language models (MLLMs) with enriched visual representations produced by Florence-2, a generative vision foundation model. Unlike the widely used CLIP-style vision transformer trained by contrastive learning, Florence-2 can capture different levels and aspects of visual features, which are more versatile to be adapted to diverse downstream tasks. We propose a novel feature-fusion architecture and an innovative training recipe that effectively integrates Florence-2's visual features into pretrained LLMs, such as Phi 3.5 and LLama 3. In particular, we propose "depth-breath fusion (DBFusion)" to fuse the visual features extracted from different depths and under multiple prompts. Our model training is composed of end-to-end pretraining of the whole model followed by finetuning of the projection layer and the LLM, on a carefully designed recipe of diverse open-source datasets that include high-quality image captions and instruction-tuning pairs. Our quantitative analysis and visualization of Florence-VL's visual features show its advantages over popular vision encoders on vision-language alignment, where the enriched depth and breath play important roles. Florence-VL achieves significant improvements over existing state-of-the-art MLLMs across various multi-modal and vision-centric benchmarks covering general VQA, perception, hallucination, OCR, Chart, knowledge-intensive understanding, etc. To facilitate future research, our models and the complete training recipe are open-sourced. https://github.com/JiuhaiChen/Florence-VL

View Paper