Kimi-VL Technical Report

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, Bowen Wang, Cheng Chen, Chenlin Zhang, Chenzhuang Du, Chu Wei, Congcong Wang, Dehao Zhang, Dikang Du, Dongliang Wang, Enming Yuan, Enzhe Lu, Fang Li, Flood Sung, Guangda Wei, Guokun Lai, Han Zhu, Hao Ding

2025-04-11

Summary

This paper introduces Kimi-VL, a new open-source AI model that can understand both images and text at a high level, similar to how people can reason about what they see and read. Kimi-VL stands out because it can handle complex tasks like solving college-level problems, reading text in images, doing math, and understanding long documents or videos, all while using less computing power than many of its competitors.

What's the problem?

The main problem is that most powerful vision-language models (which can understand both images and text) need a lot of computer resources to work well, especially for tasks that require understanding long documents, multiple images, or complex reasoning. This makes them hard to use for many people and organizations, and it limits their usefulness in real-world situations where efficiency and flexibility are important.

What's the solution?

To solve this, the authors built Kimi-VL using a special design called Mixture-of-Experts, which means the model only uses the parts it needs for each task, saving resources. They combined a high-resolution image processor (called MoonViT) with a smart language system that can handle huge amounts of information at once—up to 128,000 words or tokens. They also created a special version called Kimi-VL-Thinking that was trained to think through problems step by step, using techniques that help it plan and reflect like a human. This makes Kimi-VL both powerful and efficient, letting it compete with or beat much bigger models on many tests.

Why it matters?

Kimi-VL matters because it shows that you don't need a massive, expensive AI to get top-level results on hard tasks involving both images and language. It makes advanced AI more accessible by being open-source and efficient, and it sets a new standard for how well smaller, smarter models can perform. This could help more people and companies use AI for things like education, research, and automation without needing huge amounts of computing power.

Abstract

We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-language model (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), this model exhibits strong long-horizon reasoning capabilities. It achieves scores of 61.7 on MMMU, 36.8 on MathVision, and 71.3 on MathVista while maintaining the compact 2.8B activated LLM parameters, setting a new standard for efficient multimodal thinking models. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.

View Paper