Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models

Linghao Zhang, Jungang Li, Yonghua Hei, Sicheng Tao, Song Dai, Yibo Yan, Zihao Dongfang, Weiting Liu, Chenxi Qin, Hanqian Li, Xin Zou, Jiahao Zhang, Shuhang Xun, Haiyun Jiang, Xuming Hu

2026-03-19

Temporal Gains, Spatial Costs: Revisiting Video Fine-Tuning in Multimodal Large Language Models

Summary

This paper investigates how training AI models to understand videos affects their ability to understand both videos and still images, focusing on how well they grasp what's happening in a scene versus where things are located.

What's the problem?

Large AI models are getting better at understanding videos through a process called Video-SFT, but it's unclear if this improvement comes at the cost of their ability to understand still images. Specifically, researchers didn't know if focusing on the changing aspects of video (like motion) would make the AI worse at recognizing objects and scenes in static pictures. It seems like improving video understanding often *hurts* performance on image tasks, and the amount of video information used (how many frames) plays a big role in this.

What's the solution?

The researchers systematically tested how Video-SFT impacted visual skills across different AI model designs and amounts of video data. They found that using more video frames generally helps with video tasks but doesn't necessarily improve, and can even worsen, performance on image tasks. To address this, they developed a new method called a 'Hybrid-Frame strategy' which intelligently chooses how many video frames to use based on the specific task, attempting to balance video and image understanding.

Why it matters?

This research shows that simply training an AI on videos doesn't automatically make it better at *all* visual tasks. It highlights the challenge of building AI that can equally understand both static images and dynamic videos, and suggests that we need smarter training methods that prevent the AI from sacrificing its understanding of images while learning about videos. It's important because a well-rounded visual understanding is crucial for many real-world applications like robotics and self-driving cars.

Abstract

Multimodal large language models (MLLMs) are typically trained in multiple stages, with video-based supervised fine-tuning (Video-SFT) serving as a key step for improving visual understanding. Yet its effect on the fine-grained evolution of visual capabilities, particularly the balance between spatial and temporal understanding, remains poorly understood. In this paper, we systematically study how Video-SFT reshapes visual capabilities in MLLMs. Across architectures, parameter scales, and frame sampling settings, we observe a consistent pattern: Video-SFT reliably improves video performance, but often yields limited gains or even degradation on static image benchmarks. We further show that this trade-off is closely tied to temporal budget: increasing the number of sampled frames generally improves video performance, but does not reliably improve static image performance. Motivated by this finding, we study an instruction-aware Hybrid-Frame strategy that adaptively allocates frame counts and partially mitigates the image-video trade-off. Our results indicate that Video-SFT is not a free lunch for MLLMs, and preserving spatial understanding remains a central challenge in joint image-video training.

View Paper