POINTS: Improving Your Vision-language Model with Affordable Strategies
Yuan Liu, Zhongyin Zhao, Ziyuan Zhuang, Le Tian, Xiao Zhou, Jie Zhou
2024-09-10

Summary
This paper talks about POINTS, a set of strategies designed to improve vision-language models, which are AI systems that understand both images and text.
What's the problem?
While vision-language models have made great progress in tasks like recognizing text in images and solving geometric problems, they still face several challenges. These include a lack of transparency in proprietary models, poorly explored training data in open-source models, and diminishing returns when simply adding more datasets for fine-tuning.
What's the solution?
To tackle these issues, the authors developed a robust baseline model using the latest advancements in vision-language technology. They filtered pre-training data based on its complexity to create a curated dataset of 1 million examples, which helped improve performance. Additionally, they used a technique called 'model soup' during visual instruction tuning to combine different datasets effectively, leading to a 9 billion parameter model that competes with the best existing models.
Why it matters?
This research is important because it provides affordable and efficient strategies to enhance vision-language models. By improving how these models are trained and fine-tuned, they can perform better in real-world applications, making them more useful for tasks that require understanding both images and text.
Abstract
In recent years, vision-language models have made significant strides, excelling in tasks like optical character recognition and geometric problem-solving. However, several critical issues remain: 1) Proprietary models often lack transparency about their architectures, while open-source models need more detailed ablations of their training strategies. 2) Pre-training data in open-source works is under-explored, with datasets added empirically, making the process cumbersome. 3) Fine-tuning often focuses on adding datasets, leading to diminishing returns. To address these issues, we propose the following contributions: 1) We trained a robust baseline model using the latest advancements in vision-language models, introducing effective improvements and conducting comprehensive ablation and validation for each technique. 2) Inspired by recent work on large language models, we filtered pre-training data using perplexity, selecting the lowest perplexity data for training. This approach allowed us to train on a curated 1M dataset, achieving competitive performance. 3) During visual instruction tuning, we used model soup on different datasets when adding more datasets yielded marginal improvements. These innovations resulted in a 9B parameter model that performs competitively with state-of-the-art models. Our strategies are efficient and lightweight, making them easily adoptable by the community.