Fostering Video Reasoning via Next-Event Prediction

Haonan Wang, Hongfu Liu, Xiangyan Liu, Chao Du, Kenji Kawaguchi, Ye Wang, Tianyu Pang

2025-05-29

Fostering Video Reasoning via Next-Event Prediction

Summary

This paper talks about a new way to help AI models understand and reason about videos by teaching them to predict what will happen next in a video.

What's the problem?

The problem is that most AI models have trouble figuring out the order of events or understanding how things change over time in videos, which makes it hard for them to answer questions or make predictions about what's going on.

What's the solution?

To solve this, the researchers introduced a learning task called next-event prediction, where the AI tries to guess the next part of a video based on what it has already seen. By practicing this, the model gets better at understanding the flow of events and can reason more accurately about videos, all without needing humans to label the data.

Why it matters?

This is important because it helps AI become smarter at understanding videos, which can be useful for things like video search, safety monitoring, and creating better digital assistants that can watch and interpret video content.

Abstract

Next-event prediction (NEP) is proposed as a learning task to enable MLLMs to reason temporally over video inputs, using future video segments as a self-supervised signal.

View Paper