StyleMaster: Stylize Your Video with Artistic Generation and Translation

Zixuan Ye, Huijuan Huang, Xintao Wang, Pengfei Wan, Di Zhang, Wenhan Luo

2024-12-12

StyleMaster: Stylize Your Video with Artistic Generation and Translation

Summary

This paper talks about StyleMaster, a new method for transforming videos by applying artistic styles from reference images while keeping the original content intact.

What's the problem?

Current methods for stylizing videos often fail to accurately reflect the desired artistic style, leading to videos that look different from what was intended. They may also mix up content from the original video with the new style, which can ruin the overall effect. Additionally, many existing techniques focus too much on the overall style and ignore important details in specific areas of the video.

What's the solution?

To solve these issues, the authors of StyleMaster developed a system that carefully extracts both global styles and local textures from reference images. They filter out parts of the video that are related to content while keeping the style features intact. They also created a lightweight motion adapter that helps apply styles from still images to videos effectively. This approach allows StyleMaster to generate videos that not only look like the reference style but also maintain smooth and coherent motion.

Why it matters?

This research is important because it significantly enhances the quality of stylized video generation. By improving how styles are applied to videos, StyleMaster enables creators to produce high-quality content that closely matches their artistic vision. This can be especially useful for filmmakers, artists, and social media creators who want to make their videos more visually appealing and unique.

Abstract

Style control has been popular in video generation models. Existing methods often generate videos far from the given style, cause content leakage, and struggle to transfer one video to the desired style. Our first observation is that the style extraction stage matters, whereas existing methods emphasize global style but ignore local textures. In order to bring texture features while preventing content leakage, we filter content-related patches while retaining style ones based on prompt-patch similarity; for global style extraction, we generate a paired style dataset through model illusion to facilitate contrastive learning, which greatly enhances the absolute style consistency. Moreover, to fill in the image-to-video gap, we train a lightweight motion adapter on still videos, which implicitly enhances stylization extent, and enables our image-trained model to be seamlessly applied to videos. Benefited from these efforts, our approach, StyleMaster, not only achieves significant improvement in both style resemblance and temporal coherence, but also can easily generalize to video style transfer with a gray tile ControlNet. Extensive experiments and visualizations demonstrate that StyleMaster significantly outperforms competitors, effectively generating high-quality stylized videos that align with textual content and closely resemble the style of reference images. Our project page is at https://zixuan-ye.github.io/stylemaster

View Paper