Online Generic Event Boundary Detection

Hyungrok Jung, Daneul Kim, Seunggyun Lim, Jeany Son, Jonghyun Choi

2025-10-09

Summary

This paper introduces a new way to automatically detect when one thing ends and another begins in a video, focusing on doing this in real-time as the video plays, much like how humans understand events as they happen.

What's the problem?

Current methods for detecting event boundaries in videos need to see the entire video before making a decision, which isn't how we naturally perceive things. We process information as it comes, not all at once. The challenge is to identify subtle changes between events in a video stream *without* knowing what comes next, and doing it quickly.

What's the solution?

The researchers created a system called 'Estimator' that mimics how our brains work when understanding events. It predicts what the next frame of the video will look like based on what it's already seen. Then, it compares its prediction to the actual next frame. If there's a big difference, it suggests a new event has started. The system learns to adjust how much of a difference is needed to trigger a boundary detection, making it adaptable to different types of videos and events.

Why it matters?

This work is important because it brings computer vision closer to how humans understand video. Being able to process videos in real-time opens up possibilities for applications like automatically editing videos, creating summaries, or even helping robots understand their surroundings as they experience them.

Abstract

Generic Event Boundary Detection (GEBD) aims to interpret long-form videos through the lens of human perception. However, current GEBD methods require processing complete video frames to make predictions, unlike humans processing data online and in real-time. To bridge this gap, we introduce a new task, Online Generic Event Boundary Detection (On-GEBD), aiming to detect boundaries of generic events immediately in streaming videos. This task faces unique challenges of identifying subtle, taxonomy-free event changes in real-time, without the access to future frames. To tackle these challenges, we propose a novel On-GEBD framework, Estimator, inspired by Event Segmentation Theory (EST) which explains how humans segment ongoing activity into events by leveraging the discrepancies between predicted and actual information. Our framework consists of two key components: the Consistent Event Anticipator (CEA), and the Online Boundary Discriminator (OBD). Specifically, the CEA generates a prediction of the future frame reflecting current event dynamics based solely on prior frames. Then, the OBD measures the prediction error and adaptively adjusts the threshold using statistical tests on past errors to capture diverse, subtle event transitions. Experimental results demonstrate that Estimator outperforms all baselines adapted from recent online video understanding models and achieves performance comparable to prior offline-GEBD methods on the Kinetics-GEBD and TAPOS datasets.

View Paper