BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices
Xudong Lu, Yinghao Chen, Cheng Chen, Hui Tan, Boheng Chen, Yina Xie, Rui Hu, Guanxin Tan, Renshou Wu, Yan Hu, Yi Zeng, Lei Wu, Liuyang Bian, Zhaoxiong Wang, Long Liu, Yanzhou Yang, Han Xiao, Aojun Zhou, Yafei Wen, Xiaoxin Chen, Shuai Ren, Hongsheng Li
2024-11-19

Summary
This paper introduces BlueLM-V-3B, a new large language model designed to work efficiently on mobile devices, allowing for advanced multimodal capabilities like understanding both text and images.
What's the problem?
As large language models (LLMs) become more popular, there is a need to deploy them on mobile phones. However, mobile devices have limited memory and processing power, making it difficult to run these complex models smoothly and quickly without significant optimization.
What's the solution?
The authors developed BlueLM-V-3B, which combines algorithmic improvements with system design specifically for mobile platforms. They redesigned how the model handles image resolutions and optimized it for the hardware of mobile devices. BlueLM-V-3B has a smaller size with 2.7 billion parameters for the language model and 400 million parameters for the vision encoder, allowing it to operate efficiently. It achieves a fast generation speed of 24.4 tokens per second on specific mobile processors and performs exceptionally well in benchmarks compared to larger models.
Why it matters?
This research is important because it enables advanced AI capabilities on everyday devices like smartphones. By making powerful multimodal models accessible on mobile platforms, BlueLM-V-3B can enhance various applications such as communication, learning, and problem-solving in daily life, making technology more integrated into our routines.
Abstract
The emergence and growing popularity of multimodal large language models (MLLMs) have significant potential to enhance various aspects of daily life, from improving communication to facilitating learning and problem-solving. Mobile phones, as essential daily companions, represent the most effective and accessible deployment platform for MLLMs, enabling seamless integration into everyday tasks. However, deploying MLLMs on mobile phones presents challenges due to limitations in memory size and computational capability, making it difficult to achieve smooth and real-time processing without extensive optimization. In this paper, we present BlueLM-V-3B, an algorithm and system co-design approach specifically tailored for the efficient deployment of MLLMs on mobile platforms. To be specific, we redesign the dynamic resolution scheme adopted by mainstream MLLMs and implement system optimization for hardware-aware deployment to optimize model inference on mobile phones. BlueLM-V-3B boasts the following key highlights: (1) Small Size: BlueLM-V-3B features a language model with 2.7B parameters and a vision encoder with 400M parameters. (2) Fast Speed: BlueLM-V-3B achieves a generation speed of 24.4 token/s on the MediaTek Dimensity 9300 processor with 4-bit LLM weight quantization. (3) Strong Performance: BlueLM-V-3B has attained the highest average score of 66.1 on the OpenCompass benchmark among models with leq 4B parameters and surpassed a series of models with much larger parameter sizes (e.g., MiniCPM-V-2.6, InternVL2-8B).