Video Action Differencing

James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, Trevor Darrell, Serena Yeung-Levy

2025-03-12

Summary

This paper talks about Video Action Differencing (VidDiff), a new AI tool that spots tiny differences between videos of people doing the same thing, like how a coach might compare two athletes’ jumps or dance moves.

What's the problem?

Current AI models can’t accurately find small differences in videos of the same action, like spotting if someone’s squat is deeper or their guitar strum is faster, because they struggle to focus on the right moments and compare details frame by frame.

What's the solution?

VidDiff uses a three-step method: first, an AI suggests possible differences (like ‘higher jump’), then it finds the exact frames where those differences happen, and finally, another AI checks those frames to confirm which video shows the difference better.

Why it matters?

This helps coaches, doctors, or musicians give better feedback by automatically analyzing performance videos, making skill learning faster and more precise without needing expert eye-tracking.

Abstract

How do two individuals differ when performing the same action? In this work, we introduce Video Action Differencing (VidDiff), the novel task of identifying subtle differences between videos of the same action, which has many applications, such as coaching and skill learning. To enable development on this new task, we first create VidDiffBench, a benchmark dataset containing 549 video pairs, with human annotations of 4,469 fine-grained action differences and 2,075 localization timestamps indicating where these differences occur. Our experiments demonstrate that VidDiffBench poses a significant challenge for state-of-the-art large multimodal models (LMMs), such as GPT-4o and Qwen2-VL. By analyzing failure cases of LMMs on VidDiffBench, we highlight two key challenges for this task: localizing relevant sub-actions over two videos and fine-grained frame comparison. To overcome these, we propose the VidDiff method, an agentic workflow that breaks the task into three stages: action difference proposal, keyframe localization, and frame differencing, each stage utilizing specialized foundation models. To encourage future research in this new task, we release the benchmark at https://huggingface.co/datasets/jmhb/VidDiffBench and code at http://jmhb0.github.io/viddiff.

View Paper