SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation

Edoardo Bianchi, Antonio Liotta

2025-05-14

SkillFormer: Unified Multi-View Video Understanding for Proficiency
Estimation

Summary

This paper talks about SkillFormer, a new AI system that can watch videos from different camera angles and accurately judge how skilled someone is at a certain task, using a smart and efficient design.

What's the problem?

The problem is that it's hard for computers to fairly and accurately evaluate how well someone is doing something, especially when the videos come from different viewpoints, like from the person's own perspective and from an outside camera.

What's the solution?

The researchers created SkillFormer, which combines a strong video analysis model called TimeSformer with a special part called CrossViewFusion. This setup lets the AI understand and merge information from multiple video angles, so it can make better judgments about a person's skill level.

Why it matters?

This matters because it could help in areas like sports coaching, job training, or even medical procedures, where it's useful to have an unbiased and accurate way to measure how well someone is performing a skill.

Abstract

SkillFormer, a parameter-efficient architecture, uses the TimeSformer backbone with a CrossViewFusion module to achieve state-of-the-art accuracy in multi-view skill assessment from egocentric and exocentric videos.

View Paper