RepVideo: Rethinking Cross-Layer Representation for Video Generation
Chenyang Si, Weichen Fan, Zhengyao Lv, Ziqi Huang, Yu Qiao, Ziwei Liu
2025-01-16

Summary
This paper talks about RepVideo, a new way to make AI-generated videos look better and more consistent. It's like teaching a computer to draw moving pictures that make more sense and flow better.
What's the problem?
Current AI systems that make videos are getting really good, but they have a problem. When they're creating videos, the way they understand and represent what's happening in each part of the video can be inconsistent. It's like if you were drawing a cartoon, but kept forgetting what the characters looked like from one frame to the next. This makes the videos look choppy or weird, especially when things are moving.
What's the solution?
The researchers created RepVideo, which is like giving the AI a better memory. Instead of just looking at one part of the video at a time, RepVideo looks at several parts together. This helps it remember what things should look like and how they should move. It's like giving the AI artist a sketchbook to keep track of what it's drawing. This makes the videos look smoother and more realistic, with objects and people moving in ways that make more sense.
Why it matters?
This matters because as we use AI more and more to create videos for things like movies, video games, or even educational content, we want those videos to look as good and natural as possible. RepVideo could help make AI-generated videos that are more enjoyable to watch and easier to understand. It could also help AI better understand and work with real videos, which could be useful for things like video editing or creating special effects. In the bigger picture, this research helps us understand how to make AI think more consistently, which could be useful in many other areas beyond just making videos.
Abstract
Video generation has achieved remarkable progress with the introduction of diffusion models, which have significantly improved the quality of generated videos. However, recent research has primarily focused on scaling up model training, while offering limited insights into the direct impact of representations on the video generation process. In this paper, we initially investigate the characteristics of features in intermediate layers, finding substantial variations in attention maps across different layers. These variations lead to unstable semantic representations and contribute to cumulative differences between features, which ultimately reduce the similarity between adjacent frames and negatively affect temporal coherence. To address this, we propose RepVideo, an enhanced representation framework for text-to-video diffusion models. By accumulating features from neighboring layers to form enriched representations, this approach captures more stable semantic information. These enhanced representations are then used as inputs to the attention mechanism, thereby improving semantic expressiveness while ensuring feature consistency across adjacent frames. Extensive experiments demonstrate that our RepVideo not only significantly enhances the ability to generate accurate spatial appearances, such as capturing complex spatial relationships between multiple objects, but also improves temporal consistency in video generation.