SIGMA: Sinkhorn-Guided Masked Video Modeling

Mohammadreza Salehi, Michael Dorkenwald, Fida Mohammad Thoker, Efstratios Gavves, Cees G. M. Snoek, Yuki M. Asano

2024-07-24

SIGMA: Sinkhorn-Guided Masked Video Modeling

Summary

This paper introduces SIGMA, a new method for improving how video models learn from data. It focuses on enhancing the understanding of videos by capturing both detailed and broader features, making it easier for models to generate high-quality video representations.

What's the problem?

Many current video modeling techniques struggle to understand complex aspects of videos because they often only focus on low-level details, like individual pixels. This limitation prevents them from capturing higher-level meanings and relationships within the video content, which is essential for tasks that require a deeper understanding of the visuals.

What's the solution?

To solve this problem, SIGMA uses a unique approach that involves two main components. First, it employs a projection network that helps the model learn from both the visible parts of the video and the masked (hidden) parts at the same time. This joint learning helps improve the model's understanding of video features. Second, SIGMA organizes these features into clusters and applies an optimal transport method to ensure that the features are diverse and meaningful. This setup allows the model to predict and learn from complex interactions in videos more effectively.

Why it matters?

This research is important because it enhances how AI systems process and understand videos, leading to better performance in various applications like video analysis, surveillance, and content generation. By improving video representation learning, SIGMA can help create more advanced AI systems capable of interpreting and generating videos with greater accuracy and detail.

Abstract

Video-based pretraining offers immense potential for learning strong visual representations on an unprecedented scale. Recently, masked video modeling methods have shown promising scalability, yet fall short in capturing higher-level semantics due to reconstructing predefined low-level targets such as pixels. To tackle this, we present Sinkhorn-guided Masked Video Modelling (SIGMA), a novel video pretraining method that jointly learns the video model in addition to a target feature space using a projection network. However, this simple modification means that the regular L2 reconstruction loss will lead to trivial solutions as both networks are jointly optimized. As a solution, we distribute features of space-time tubes evenly across a limited number of learnable clusters. By posing this as an optimal transport problem, we enforce high entropy in the generated features across the batch, infusing semantic and temporal meaning into the feature space. The resulting cluster assignments are used as targets for a symmetric prediction task where the video model predicts cluster assignment of the projection network and vice versa. Experimental results on ten datasets across three benchmarks validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations improving upon state-of-the-art methods. Our project website with code is available at: https://quva-lab.github.io/SIGMA.

View Paper