AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model

Zhiwei Jin, Xiaohui Song, Nan Wang, Yafei Liu, Chao Li, Xin Li, Ruichen Wang, Zhihao Li, Qi Qi, Long Cheng, Dongze Hao, Quanlong Zheng, Yanhao Zhang, Haobo Ji, Jian Ma, Zhitong Zheng, Zhenyi Lin, Haolin Deng, Xin Zou, Xiaojie Yin, Ruilin Wang, Liankai Cai

2025-10-14

AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model

Summary

This paper introduces AndesVL, a new family of AI models designed to understand both images and text, specifically built to run efficiently on mobile phones.

What's the problem?

Current powerful AI models that can understand images and text, like those from Google and OpenAI, are incredibly large and require a lot of computing power, memory, and energy. This makes them impractical to use directly on devices like smartphones which have limited resources.

What's the solution?

The researchers created AndesVL, a set of smaller AI models ranging from 0.6 billion to 4 billion parameters, based on existing technology called Qwen3. They carefully designed the model's structure, how it was trained, and the data it learned from. They also used a technique called LoRA to further optimize performance. These models can perform well on various tasks, like understanding images with lots of text, solving visual reasoning problems, and even understanding what's happening in phone screenshots.

Why it matters?

AndesVL is important because it brings advanced image and text understanding capabilities to mobile devices. This means you could have AI features like smart photo organization, visual assistance, or even more interactive apps directly on your phone, without needing to send your data to the cloud. It opens the door for more powerful and private AI experiences on the devices we use every day.

Abstract

In recent years, while cloud-based MLLMs such as QwenVL, InternVL, GPT-4o, Gemini, and Claude Sonnet have demonstrated outstanding performance with enormous model sizes reaching hundreds of billions of parameters, they significantly surpass the limitations in memory, power consumption, and computing capacity of edge devices such as mobile phones. This paper introduces AndesVL, a suite of mobile-side MLLMs with 0.6B to 4B parameters based on Qwen3's LLM and various visual encoders. We comprehensively outline the model architectures, training pipeline, and training data of AndesVL, which achieves first-tier performance across a wide range of open-source benchmarks, including fields such as text-rich image understanding, reasoning and math, multi-image comprehension, general VQA, hallucination mitigation, multilingual understanding, and GUI-related tasks when compared with state-of-the-art models of a similar scale. Furthermore, we introduce a 1+N LoR

View Paper