MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu

2024-08-06

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Summary

This paper introduces MiniCPM-V, a new series of efficient multimodal large language models (MLLMs) that can run on mobile devices, providing high performance similar to advanced models like GPT-4V.

What's the problem?

Most large language models require a lot of computing power and resources, which means they usually run on powerful cloud servers. This makes it hard to use them in everyday situations, like on smartphones or in places where internet access is limited. It also raises concerns about privacy and energy consumption.

What's the solution?

MiniCPM-V addresses these challenges by being designed specifically for end-side devices, such as mobile phones. It incorporates the latest techniques in model architecture and training, allowing it to perform well while using fewer resources. The model has strong features like high-resolution image processing, support for over 30 languages, and reliable performance with low error rates. This means it can handle tasks like understanding images and text together without needing a powerful server.

Why it matters?

This research is important because it makes advanced AI technology more accessible by enabling its use on everyday devices. By allowing powerful models to run on smartphones, MiniCPM-V opens up new possibilities for applications in areas like personal assistants, education, and entertainment, making AI tools available to more people in various contexts.

Abstract

The recent surge of Multimodal Large Language Models (MLLMs) has fundamentally reshaped the landscape of AI research and industry, shedding light on a promising path toward the next AI milestone. However, significant challenges remain preventing MLLMs from being practical in real-world applications. The most notable challenge comes from the huge cost of running an MLLM with a massive number of parameters and extensive computation. As a result, most MLLMs need to be deployed on high-performing cloud servers, which greatly limits their application scopes such as mobile, offline, energy-sensitive, and privacy-protective scenarios. In this work, we present MiniCPM-V, a series of efficient MLLMs deployable on end-side devices. By integrating the latest MLLM techniques in architecture, pretraining and alignment, the latest MiniCPM-Llama3-V 2.5 has several notable features: (1) Strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks, (2) strong OCR capability and 1.8M pixel high-resolution image perception at any aspect ratio, (3) trustworthy behavior with low hallucination rates, (4) multilingual support for 30+ languages, and (5) efficient deployment on mobile phones. More importantly, MiniCPM-V can be viewed as a representative example of a promising trend: The model sizes for achieving usable (e.g., GPT-4V) level performance are rapidly decreasing, along with the fast growth of end-side computation capacity. This jointly shows that GPT-4V level MLLMs deployed on end devices are becoming increasingly possible, unlocking a wider spectrum of real-world AI applications in the near future.

View Paper