SAIL-VL2 Technical Report

Weijie Yin, Yongjie Ye, Fangxun Shu, Yue Liao, Zijian Kang, Hongyuan Dong, Haiyang Yu, Dingkang Yang, Jiacong Wang, Han Wang, Wenzhuo Liu, Xiao Liang, Shuicheng Yan, Chao Feng

2025-09-18

Summary

This paper introduces SAIL-VL2, a new and improved computer model designed to understand both images and language together, allowing it to reason about what it 'sees' and 'reads'. It's a powerful tool for tasks requiring multimodal understanding.

What's the problem?

Existing vision-language models often struggle with complex reasoning tasks and aren't always efficient in terms of their size and performance. Building a model that can accurately interpret images and text, perform detailed analysis, and tackle challenging problems like visual question answering or math problems based on images is a significant hurdle. Also, many strong models aren't openly available for others to build upon.

What's the solution?

The researchers tackled this by creating SAIL-VL2 using three main strategies. First, they carefully collected and cleaned a huge amount of image and video data, making sure it was high-quality and covered a wide range of scenarios. Second, they trained the model in stages, starting with a strong image understanding component and then gradually adding the ability to process language and combine the two. Finally, they used a clever model design called a Mixture-of-Experts, which allows the model to be very powerful without becoming unnecessarily large and slow. They also used a combination of supervised fine-tuning and reinforcement learning to improve its reasoning abilities.

Why it matters?

SAIL-VL2 is important because it achieves top-level performance on many different tests, even surpassing other models with more parameters. It’s also open-source, meaning other researchers and developers can use and improve upon it. This contributes to the advancement of the field of multimodal AI and provides a strong foundation for building even more capable AI systems that can understand the world around us like humans do.

Abstract

We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM) for comprehensive multimodal understanding and reasoning. As the successor to SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks, demonstrating strong capabilities from fine-grained perception to complex reasoning. Three core innovations drive its effectiveness. First, a large-scale data curation pipeline with scoring and filtering strategies enhances both quality and distribution across captioning, OCR, QA, and video data, improving training efficiency. Second, a progressive training framework begins with a powerful pre-trained vision encoder (SAIL-ViT), advances through multimodal pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that systematically strengthens model capabilities. Third, architectural advances extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs. With these contributions, SAIL-VL2 demonstrates competitive performance across 106 datasets and achieves state-of-the-art results on challenging reasoning benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass leaderboard, SAIL-VL2-2B ranks first among officially released open-source models under the 4B parameter scale, while serving as an efficient and extensible foundation for the open-source multimodal community.

View Paper