< Explain other AI papers

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba, Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec

2025-06-18

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction
  and Planning

Summary

This paper talks about V-JEPA 2, a type of AI model that learns to understand videos by watching a huge amount of video data online and then uses that knowledge to predict actions, answer questions about videos, and even help robots plan movements.

What's the problem?

The problem is that teaching AI to really understand videos and predict what will happen next usually requires a lot of specially labeled data and task-specific training, which is expensive and slow.

What's the solution?

The researchers used a method called self-supervised learning, where the AI learns to predict missing parts of videos without needing labels. They trained V-JEPA 2 on over a million hours of internet videos and then fine-tuned it with a small amount of robot videos to help it learn how actions relate to movements. This training lets the model understand motion, anticipate human actions, answer questions about videos, and plan robot movements without needing special training for each task.

Why it matters?

This matters because it shows AI can learn from watching lots of videos on its own and use that knowledge in many helpful ways, like making robots better at doing tasks or improving video understanding systems, without needing tons of costly labeled examples.

Abstract

A self-supervised approach combining internet video data and minimal robot interaction achieves strong performances in motion understanding, action anticipation, video question-answering, and robotic planning without task-specific training or reward.