Qwen3-VL Technical Report
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin
2025-12-04
Summary
This paper introduces Qwen3-VL, a new and powerful vision-language model, meaning it can understand and connect information from both images/videos and text. It's the best model of its kind created by the Qwen team so far, performing well on many different tests that measure how well AI can handle both visual and textual information.
What's the problem?
Existing vision-language models often struggle with a few key things. They aren't always great at understanding text on its own, have trouble keeping track of information in very long pieces of content (like long documents or videos), and sometimes have difficulty accurately reasoning about what they 'see' in images or videos. Essentially, they need to be better at understanding complex information across different formats and over extended periods.
What's the solution?
The researchers developed Qwen3-VL with several improvements. First, they made the model better at understanding text independently. Second, they increased the amount of information the model can process at once to a huge 256,000 tokens – that’s like reading a whole novel in one go! This allows it to remember and connect details across long videos and documents. Finally, they improved how the model reasons about images and videos, making it more accurate on tasks like solving visual math problems. They achieved this through upgrades to how the model processes spatial and temporal information, better integration of visual features, and more precise alignment of text with specific moments in videos.
Why it matters?
Qwen3-VL is important because it represents a significant step forward in AI's ability to understand the world like humans do – by combining what we see and what we read. This could be used to build more intelligent systems for things like image-based problem solving, creating AI assistants that can make decisions based on visual input, and even helping AI understand and generate code related to images and videos.
Abstract
We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.