SemanticMoments: Training-Free Motion Similarity via Third Moment Features

Saar Huberman, Kfir Goldberg, Or Patashnik, Sagie Benaim, Ron Mokady

2026-02-16

SemanticMoments: Training-Free Motion Similarity via Third Moment Features

Summary

This paper focuses on the challenge of accurately finding videos based on *what* actions are happening in them, specifically the motion involved, rather than just what things *look* like or the surrounding environment.

What's the problem?

Current video analysis systems are really good at recognizing objects and scenes, but they often miss the importance of the actual movement within a video. They're trained on data that emphasizes appearance, so they get confused when motion is the key factor. Traditional methods that focus on motion, like tracking how pixels change, don't understand *what* the motion means – is it someone walking, running, or waving? The paper highlights this issue by creating new tests, called SimMotion, that specifically challenge systems to focus on motion and separate it from appearance.

What's the solution?

The researchers developed a new technique called SemanticMoments. It's surprisingly simple: it takes pre-existing video analysis models (already trained to understand what's happening in a video) and looks at how features change over time. Instead of just looking at the features themselves, it calculates statistical summaries – specifically, higher-order moments – of those features. This captures the *pattern* of motion without needing any additional training. It's like looking at the shape of the movement, not just individual snapshots.

Why it matters?

This work is important because it shows that understanding motion in videos doesn't require complex new models or massive amounts of training data. By focusing on the temporal statistics of semantic features, SemanticMoments provides a more effective and understandable way to analyze video motion, which could improve things like video search, action recognition, and robotics.

Abstract

Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centric inputs like optical flow lack the semantic grounding needed to understand high-level motion. To demonstrate this inherent bias, we introduce the SimMotion benchmarks, combining controlled synthetic data with a new human-annotated real-world dataset. We show that existing models perform poorly on these benchmarks, often failing to disentangle motion from appearance. To address this gap, we propose SemanticMoments, a simple, training-free method that computes temporal statistics (specifically, higher-order moments) over features from pre-trained semantic models. Across our benchmarks, SemanticMoments consistently outperforms existing RGB, flow, and text-supervised methods. This demonstrates that temporal statistics in a semantic feature space provide a scalable and perceptually grounded foundation for motion-centric video understanding.

View Paper