Video-CoE: Reinforcing Video Event Prediction via Chain of Events
Qile Su, Jing Tang, Rui Chen, Lei Sun, Xiangxiang Chu
2026-03-19
Summary
This paper focuses on a challenging problem for artificial intelligence: predicting what will happen next in a video. It explores how well current AI models, specifically those that understand both images and language, can actually anticipate future events in videos.
What's the problem?
Existing AI models struggle with video event prediction because it requires understanding not just *what* is happening in a video, but also *why* things are happening and what logically follows. They often miss subtle visual cues and have trouble reasoning about the sequence of events to accurately guess what comes next. The paper shows these models aren't very good at this task and identifies that they lack the ability to connect visual information with logical reasoning about future possibilities.
What's the solution?
The researchers introduced a new approach called 'Chain of Events.' This method helps the AI model focus on the important visual details in the video and understand the connections between events over time. It essentially forces the model to think through a series of steps – creating a 'chain' of what’s happening and what’s likely to happen next – to improve its predictions. They also used different training techniques to encourage the model to reason more effectively.
Why it matters?
Improving video event prediction is important because it's a key step towards creating AI systems that can truly understand and interact with the world around them. This technology could be used in things like self-driving cars, security systems, or even helping people with disabilities, allowing machines to anticipate needs and respond appropriately. This research sets a new standard for performance in this area, showing what's possible with a more thoughtful approach to AI video understanding.
Abstract
Despite advances in the application of MLLMs for various video tasks, video event prediction (VEP) remains relatively underexplored. VEP requires the model to perform fine-grained temporal modeling of videos and establish logical relationships between videos and future events, which current MLLMs still struggle with. In this work, we first present a comprehensive evaluation of current leading MLLMs on the VEP task, revealing the reasons behind their inaccurate predictions, including lack of logical reasoning ability for future events prediction and insufficient utilization of visual information. To address these challenges, we propose Chain of Events (CoE) paradigm, which constructs temporal event chains to implicitly enforce MLLM focusing on the visual content and the logical connections between videos and future events, incentivizing model's reasoning capability with multiple training protocols. Experimental results on public benchmarks demonstrate that our method outperforms both leading open-source and commercial MLLMs, establishing a new state-of-the-art on the VEP task. Codes and models will be released soon.