Find the Leak, Fix the Split: Cluster-Based Method to Prevent Leakage in Video-Derived Datasets

Noam Glazner, Noam Tsfaty, Sharon Shalev, Avishai Weizman

2025-12-01

Find the Leak, Fix the Split: Cluster-Based Method to Prevent Leakage in Video-Derived Datasets

Summary

This paper focuses on improving the way datasets of video frames are created for training computer vision models, specifically to prevent the model from 'cheating' by learning from very similar images.

What's the problem?

When you create a dataset from a video, frames that are very close together in time are often very similar visually. If you randomly split these frames into training, validation, and testing sets, the test set might contain frames that are almost identical to frames the model already saw during training. This doesn't accurately test how well the model *really* generalizes to new, unseen video content, and can give a misleadingly optimistic view of its performance – it's like letting a student see the answers before a test.

What's the solution?

The researchers came up with a method where they first group together frames that look alike – essentially creating 'clusters' of similar images. Then, they split these clusters into the training, validation, and test sets. This ensures that the training, validation, and test sets contain a more diverse set of visual information, and that the test set isn't just slightly different versions of images the model has already learned from.

Why it matters?

This is important because it leads to more reliable and trustworthy evaluations of video analysis models. If we can be sure the model is performing well on truly unseen data, we can have more confidence in its ability to work in real-world applications like self-driving cars or video surveillance.

Abstract

We propose a cluster-based frame selection strategy to mitigate information leakage in video-derived frames datasets. By grouping visually similar frames before splitting into training, validation, and test sets, the method produces more representative, balanced, and reliable dataset partitions.

View Paper