Dynamic Concepts Personalization from Single Videos

Rameen Abdal, Or Patashnik, Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Sergey Tulyakov, Daniel Cohen-Or, Kfir Aberman

2025-02-21

Dynamic Concepts Personalization from Single Videos

Summary

This paper talks about a new way to make AI-generated videos more personalized and realistic by teaching the AI to understand both how things look and how they move. It's like teaching a computer to not just recognize a person, but also capture their unique way of walking or dancing.

What's the problem?

Current AI systems are good at creating images of specific people or things, but they struggle with videos. That's because videos aren't just about how something looks, but also how it moves over time. It's like the difference between a photo and a movie - the movie needs to get the movement right too.

What's the solution?

The researchers created a system called Set-and-Sequence. It works in two steps: First, it learns what something looks like from a bunch of still frames from a video. Then, it learns how that thing moves by watching the whole video. This two-step process helps the AI understand both appearance and motion separately, which makes it easier to edit videos later.

Why it matters?

This matters because it could make AI-generated videos much more realistic and customizable. Imagine being able to change someone's clothes in a video while keeping their exact dance moves, or putting a real person into a completely new video scene. This technology could be huge for movies, video games, and even how we interact with social media or virtual reality in the future.

Abstract

Personalizing generative text-to-image models has seen remarkable progress, but extending this personalization to text-to-video models presents unique challenges. Unlike static concepts, personalizing text-to-video models has the potential to capture dynamic concepts, i.e., entities defined not only by their appearance but also by their motion. In this paper, we introduce Set-and-Sequence, a novel framework for personalizing Diffusion Transformers (DiTs)-based generative video models with dynamic concepts. Our approach imposes a spatio-temporal weight space within an architecture that does not explicitly separate spatial and temporal features. This is achieved in two key stages. First, we fine-tune Low-Rank Adaptation (LoRA) layers using an unordered set of frames from the video to learn an identity LoRA basis that represents the appearance, free from temporal interference. In the second stage, with the identity LoRAs frozen, we augment their coefficients with Motion Residuals and fine-tune them on the full video sequence, capturing motion dynamics. Our Set-and-Sequence framework results in a spatio-temporal weight space that effectively embeds dynamic concepts into the video model's output domain, enabling unprecedented editability and compositionality while setting a new benchmark for personalizing dynamic concepts.

View Paper