Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yunzhong Xiao, Chao Huang, Zhiyuan Wang, Susan Liang, Xinyi Liu, Yizhi Song, Yuhe Nie, Jia-Xing Zhong, Bozheng Li, Daiqing Qi, Ziyun Zeng, Ali Vosoughi, Luchuan Song
2025-10-07

Summary
This paper is a comprehensive overview of how to improve video-understanding AI models, specifically those called Video-Large Multimodal Models (Video-LMMs), *after* they’ve already been initially trained.
What's the problem?
Video-LMMs are good at initially processing video and language, but they need extra training to become truly skilled at *reasoning* about what’s happening in a video. The methods for this extra training are scattered and not well-organized, making it hard for researchers to know what works best and how different techniques relate to each other. Specifically, these models struggle with understanding events over time, pinpointing exactly *where* and *when* things happen in a video, handling very long videos efficiently, and combining information from both the video and any accompanying text.
What's the solution?
The paper breaks down the post-training process into three main areas: first, fine-tuning the model with detailed, step-by-step reasoning examples; second, using reinforcement learning where the model gets 'rewards' for verifiable, correct answers; and third, improving how the model processes information *during* use, without changing the model itself. It organizes these techniques, explains how they connect, and highlights adaptations needed specifically for video data. The authors also analyze existing methods, identify key principles for success, and suggest how to evaluate these models effectively.
Why it matters?
This work is important because it provides a unified guide for researchers and developers working with Video-LMMs. By clearly outlining the best practices and open challenges in post-training, it helps accelerate progress in video understanding AI, leading to more capable and reliable systems for tasks like video analysis, automated content understanding, and more.
Abstract
Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training