Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Jiuhai Chen, Jianwei Yang, Haiping Wu, Dianqi Li, Jianfeng Gao, Tianyi Zhou, Bin Xiao

2024-12-06

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Summary

This paper talks about Florence-VL, a new type of large language model that combines visual and text information in a smarter way using a method called Depth-Breadth Fusion to improve how machines understand images and language together.

What's the problem?

Traditional vision-language models often struggle to effectively combine visual features from images with text. They typically use a method called contrastive learning, which isn't as flexible and can miss important details, making it hard for the model to perform well on various tasks.

What's the solution?

Florence-VL uses a new generative vision model called Florence-2, which captures a wider range of visual features. The authors introduced Depth-Breadth Fusion (DBFusion), a technique that merges visual features from different layers of the model to create richer representations. This method allows the model to be trained on diverse datasets, improving its performance on many tasks like answering questions about images or understanding complex visuals.

Why it matters?

This research is significant because it enhances how machines interpret both images and text, leading to better performance in applications like visual question answering and image captioning. By open-sourcing their models and training methods, the authors hope to encourage further advancements in this field, making technology more capable of understanding the world around us.

Abstract

We present Florence-VL, a new family of multimodal large language models (MLLMs) with enriched visual representations produced by Florence-2, a generative vision foundation model. Unlike the widely used CLIP-style vision transformer trained by contrastive learning, Florence-2 can capture different levels and aspects of visual features, which are more versatile to be adapted to diverse downstream tasks. We propose a novel feature-fusion architecture and an innovative training recipe that effectively integrates Florence-2's visual features into pretrained LLMs, such as Phi 3.5 and LLama 3. In particular, we propose "depth-breath fusion (DBFusion)" to fuse the visual features extracted from different depths and under multiple prompts. Our model training is composed of end-to-end pretraining of the whole model followed by finetuning of the projection layer and the LLM, on a carefully designed recipe of diverse open-source datasets that include high-quality image captions and instruction-tuning pairs. Our quantitative analysis and visualization of Florence-VL's visual features show its advantages over popular vision encoders on vision-language alignment, where the enriched depth and breath play important roles. Florence-VL achieves significant improvements over existing state-of-the-art MLLMs across various multi-modal and vision-centric benchmarks covering general VQA, perception, hallucination, OCR, Chart, knowledge-intensive understanding, etc. To facilitate future research, our models and the complete training recipe are open-sourced. https://github.com/JiuhaiChen/Florence-VL

View Paper