Kwai Keye-VL 1.5 Technical Report

Biao Yang, Bin Wen, Boyang Ding, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Guowang Zhang, Han Shen, Hao Peng, Haojie Ding, Hao Wang, Hengrui Ju, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua

2025-09-03

Summary

This paper introduces Keye-VL-1.5, a new and improved model for understanding videos, building on the recent advancements in Large Language Models (LLMs) that can now handle different types of data like images and text.

What's the problem?

Understanding videos is really hard for computers. Videos have a lot going on, changing quickly, and contain tons of information. Current models struggle to balance looking at the details in each frame (spatial resolution) with tracking changes over time (temporal coverage). They either miss important details or can't follow the story of the video.

What's the solution?

The researchers tackled this problem in three main ways. First, they created a smart system that decides how much detail to focus on in each part of the video, processing important changes closely and quickly skimming over parts that don't change much. Second, they gradually trained the model to handle longer and longer videos, eventually allowing it to process videos with a lot of content. Finally, they refined the model’s reasoning skills and made sure it gives answers that people agree with, using a careful process of creating training data and fine-tuning the model’s responses.

Why it matters?

This work is important because it significantly improves how well computers can understand videos. This has a lot of potential applications, like better video search, automatic video summarization, and even robots that can understand and react to the world around them. Keye-VL-1.5 performs better than other models at understanding what’s happening in videos while still being good at other tasks involving images and text.

Abstract

In recent years, the development of Large Language Models (LLMs) has significantly advanced, extending their capabilities to multimodal tasks through Multimodal Large Language Models (MLLMs). However, video understanding remains a challenging area due to the dynamic and information-dense nature of videos. Existing models struggle with the trade-off between spatial resolution and temporal coverage when processing video content. We present Keye-VL-1.5, which addresses fundamental challenges in video comprehension through three key innovations. First, we introduce a novel Slow-Fast video encoding strategy that dynamically allocates computational resources based on inter-frame similarity, processing key frames with significant visual changes at higher resolution (Slow pathway) while handling relatively static frames with increased temporal coverage at lower resolution (Fast pathway). Second, we implement a progressive four-stage pre-training methodology that systematically extends the model's context length from 8K to 128K tokens, enabling processing of longer videos and more complex visual content. Third, we develop a comprehensive post-training pipeline focusing on reasoning enhancement and human preference alignment, incorporating a 5-step chain-of-thought data construction process, iterative GSPO-based reinforcement learning with progressive prompt hinting for difficult cases, and alignment training. Through extensive evaluation on public benchmarks and rigorous internal human assessment, Keye-VL-1.5 demonstrates significant improvements over existing models, particularly excelling in video understanding tasks while maintaining competitive performance on general multimodal benchmarks.

View Paper