TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models

Hyeongmin Lee, Jin-Young Kim, Kyungjune Baek, Jihwan Kim, Hyojun Go, Seongsu Ha, Seokjin Han, Jiho Jang, Raehyuk Jung, Daewoo Kim, GeunOh Kim, JongMok Kim, Jongseok Kim, Junwan Kim, Soonwoo Kwon, Jangwon Lee, Seungjoon Park, Minjoon Seo, Jay Suh, Jaehyuk Yi, Aiden Lee

2024-08-22

TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models

Summary

This paper discusses how to fairly and effectively evaluate video foundation models, which are important for understanding and processing video content.

What's the problem?

Evaluating video models is challenging because different models are tested under varying conditions, such as the number of frames or how they were trained. This inconsistency makes it hard to compare their performance accurately.

What's the solution?

The authors propose a new evaluation framework that focuses on two key abilities: understanding what things look like (appearance) and how they move (motion). They introduce a new model called TWLV-I that improves on existing models by providing better visual representations for both motion and appearance. Their evaluations show that TWLV-I outperforms other models in several tests.

Why it matters?

This research is significant because it helps establish a standard way to evaluate video models, which can lead to better performance in tasks like action recognition. Improving how we assess these models can enhance applications in areas such as entertainment, security, and autonomous driving.

Abstract

In this work, we discuss evaluating video foundation models in a fair and robust manner. Unlike language or image foundation models, many video foundation models are evaluated with differing parameters (such as sampling rate, number of frames, pretraining steps, etc.), making fair and robust comparisons challenging. Therefore, we present a carefully designed evaluation framework for measuring two core capabilities of video comprehension: appearance and motion understanding. Our findings reveal that existing video foundation models, whether text-supervised like UMT or InternVideo2, or self-supervised like V-JEPA, exhibit limitations in at least one of these capabilities. As an alternative, we introduce TWLV-I, a new video foundation model that constructs robust visual representations for both motion- and appearance-based videos. Based on the average top-1 accuracy of linear probing on five action recognition benchmarks, pretrained only on publicly accessible datasets, our model shows a 4.6%p improvement compared to V-JEPA (ViT-L) and a 7.7%p improvement compared to UMT (ViT-L). Even when compared to much larger models, our model demonstrates a 7.2%p improvement compared to DFN (ViT-H), a 2.7%p improvement compared to V-JEPA~(ViT-H) and a 2.8%p improvement compared to InternVideo2 (ViT-g). We provide embedding vectors obtained by TWLV-I from videos of several commonly used video benchmarks, along with evaluation source code that can directly utilize these embeddings. The code is available on "https://github.com/twelvelabs-io/video-embeddings-evaluation-framework".

View Paper