VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, Dahua Lin, Kai Chen

2024-07-17

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

Summary

This paper presents VLMEvalKit, an open-source toolkit designed to help researchers and developers evaluate large multi-modality models, which can process and understand different types of data like text, images, and more.

What's the problem?

Evaluating large multi-modality models can be complicated and time-consuming because researchers often have to deal with multiple datasets and benchmarks. This makes it difficult to get consistent results and can slow down the progress of research in this area.

What's the solution?

VLMEvalKit simplifies the evaluation process by providing a user-friendly framework that supports over 70 different multi-modality models and more than 20 benchmarks. It allows users to easily add new models and automatically handles tasks like data preparation, running tests, and calculating results. This means researchers can focus on their work without getting bogged down by technical details. The toolkit is designed to be flexible enough to incorporate future updates for other types of data, such as audio and video.

Why it matters?

This research is important because it provides a valuable resource for the AI community, making it easier to evaluate and compare different models. By improving how we assess multi-modality models, VLMEvalKit can help accelerate advancements in AI technologies that rely on understanding multiple types of information, which is essential for applications in areas like computer vision, natural language processing, and robotics.

Abstract

We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework for researchers and developers to evaluate existing multi-modality models and publish reproducible evaluation results. In VLMEvalKit, we implement over 70 different large multi-modality models, including both proprietary APIs and open-source models, as well as more than 20 different multi-modal benchmarks. By implementing a single interface, new models can be easily added to the toolkit, while the toolkit automatically handles the remaining workloads, including data preparation, distributed inference, prediction post-processing, and metric calculation. Although the toolkit is currently mainly used for evaluating large vision-language models, its design is compatible with future updates that incorporate additional modalities, such as audio and video. Based on the evaluation results obtained with the toolkit, we host OpenVLM Leaderboard, a comprehensive leaderboard to track the progress of multi-modality learning research. The toolkit is released at https://github.com/open-compass/VLMEvalKit and is actively maintained.

View Paper