Enhance-A-Video: Better Generated Video for Free
Yang Luo, Xuanlei Zhao, Mengzhao Chen, Kaipeng Zhang, Wenqi Shao, Kai Wang, Zhangyang Wang, Yang You
2025-02-12
Summary
This paper talks about Enhance-A-Video, a new method to improve AI-generated videos without needing to retrain the AI models. It focuses on making videos look better and more consistent from frame to frame.
What's the problem?
AI models that create videos have gotten really good, but they sometimes struggle to make videos that look smooth and consistent throughout. The frames might not always match up well, which can make the video look choppy or unrealistic. Also, improving these AI models usually requires a lot of time and computing power to retrain them.
What's the solution?
The researchers came up with Enhance-A-Video, which works by improving how the AI pays attention to different parts of the video over time. It doesn't need any retraining, so it's quick and easy to use. The method focuses on making sure each frame of the video relates well to the frames before and after it, creating a more coherent and visually appealing result.
Why it matters?
This matters because it could make AI-generated videos look much better without needing expensive and time-consuming retraining of AI models. It could be used to improve various types of AI video creation tools, making them produce more realistic and smooth videos for things like special effects, virtual reality, or even creating educational content. By making it easier to enhance AI-generated videos, this research could lead to more widespread use of AI in video production across many industries.
Abstract
DiT-based video generation has achieved remarkable results, but research into enhancing existing models remains relatively unexplored. In this work, we introduce a training-free approach to enhance the coherence and quality of DiT-based generated videos, named Enhance-A-Video. The core idea is enhancing the cross-frame correlations based on non-diagonal temporal attention distributions. Thanks to its simple design, our approach can be easily applied to most DiT-based video generation frameworks without any retraining or fine-tuning. Across various DiT-based video generation models, our approach demonstrates promising improvements in both <PRE_TAG>temporal consistency</POST_TAG> and visual quality. We hope this research can inspire future explorations in video generation enhancement.