TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action

Jen-Hao Cheng, Vivian Wang, Huayu Wang, Huapeng Zhou, Yi-Hao Peng, Hou-I Liu, Hsiang-Wei Huang, Kuang-Ming Chen, Cheng-Yen Yang, Wenhao Chai, Yi-Ling Chen, Vibhav Vineet, Qin Cai, Jenq-Neng Hwang

2025-05-06

TEMPURA: Temporal Event Masked Prediction and Understanding for
Reasoning in Action

Summary

This paper talks about TEMPURA, a new AI system that gets better at understanding what happens in videos by learning to fill in missing events, explain why things happen, and describe scenes in detail.

What's the problem?

AI often struggles to make sense of videos, especially when it comes to figuring out the order of events, why things happen, or describing everything accurately, which limits how well it can help with video analysis.

What's the solution?

The researchers trained TEMPURA in two steps: first, it learns to predict and reconstruct events that are hidden or missing in videos, and then it uses this skill to give clear explanations, break videos into meaningful parts, and write detailed captions, all using a huge video dataset.

Why it matters?

This matters because it helps AI understand videos more like people do, making it more useful for things like security, education, entertainment, and helping people find important moments in long videos.

Abstract

TEMPURA, a two-stage training framework, enhances video temporal understanding by reconstructing missing events, generating causal explanations, and performing dense video segmentation and captioning using a large-scale dataset.

View Paper