VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models

Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, Shelly Sheynin

2025-02-05

VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion
Generation in Video Models

Summary

This paper talks about VideoJAM, a new way to make AI-generated videos look more realistic by improving how objects and people move in the videos. It's like teaching a computer to understand not just what things look like, but also how they should move naturally.

What's the problem?

Current AI models that create videos are really good at making things look realistic, but they often mess up when it comes to movement. For example, they might show a person walking in a weird way or objects moving in ways that don't make sense in the real world. This happens because these models focus too much on making each frame look good individually, without considering how things should move from one frame to the next.

What's the solution?

The researchers created VideoJAM, which does two main things. First, it teaches the AI to understand both how things look and how they move at the same time. Second, it uses something called Inner-Guidance, which helps the AI check and correct its own work while it's making the video, kind of like having a built-in coach. The cool thing is that VideoJAM can be added to existing video-making AI without needing to change much or use more computer power.

Why it matters?

This matters because it could make AI-generated videos look much more realistic and natural, which is important for things like special effects in movies, creating educational videos, or even helping designers visualize their ideas. By making AI-generated videos look more lifelike, VideoJAM could open up new possibilities for how we use AI in creative and practical ways, potentially changing how we make and use videos in the future.

Abstract

Despite tremendous recent progress, generative video models still struggle to capture real-world motion, dynamics, and physics. We show that this limitation arises from the conventional pixel reconstruction objective, which biases models toward appearance fidelity at the expense of motion coherence. To address this, we introduce VideoJAM, a novel framework that instills an effective motion prior to video generators, by encouraging the model to learn a joint appearance-motion representation. VideoJAM is composed of two complementary units. During training, we extend the objective to predict both the generated pixels and their corresponding motion from a single learned representation. During inference, we introduce Inner-Guidance, a mechanism that steers the generation toward coherent motion by leveraging the model's own evolving motion prediction as a dynamic guidance signal. Notably, our framework can be applied to any video model with minimal adaptations, requiring no modifications to the training data or scaling of the model. VideoJAM achieves state-of-the-art performance in motion coherence, surpassing highly competitive proprietary models while also enhancing the perceived visual quality of the generations. These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation. Project website: https://hila-chefer.github.io/videojam-paper.github.io/

View Paper