PickStyle: Video-to-Video Style Transfer with Context-Style Adapters
Soroush Mehraban, Vida Adeli, Jacob Rommann, Babak Taati, Kyryl Truskovskyi
2025-10-10
Summary
This paper focuses on changing the visual style of videos, like making a video look like a painting or a different movie, using artificial intelligence. It uses a type of AI called diffusion models, which are good at creating realistic images and videos.
What's the problem?
The biggest challenge is that you need a lot of example videos showing the 'before' and 'after' styles to train these AI models. However, it's really hard to find or create these paired video datasets. Essentially, the AI needs to learn how to change a video's style without losing the original content, and it's difficult to teach it this without direct examples of how things *should* look.
What's the solution?
The researchers developed a system called PickStyle. It cleverly uses existing images with style matches to help train the AI. They add small, adjustable parts to the AI model that focus on style changes, while keeping the video's original motion and content intact. They also created a way to make training videos from still images, simulating camera movements to give the AI a sense of how videos change over time. Finally, they improved how the AI understands instructions about style, making sure it focuses on the style and not on altering the video's core content.
Why it matters?
This work is important because it makes video style transfer much more practical. By reducing the need for huge paired video datasets, it opens the door to creating more tools for video editing, special effects, and artistic expression. It allows anyone to easily change the look of videos without needing extensive resources or technical expertise, and the results are high-quality and realistic.
Abstract
We address the task of video style transfer with diffusion models, where the goal is to preserve the context of an input video while rendering it in a target style specified by a text prompt. A major challenge is the lack of paired video data for supervision. We propose PickStyle, a video-to-video style transfer framework that augments pretrained video diffusion backbones with style adapters and benefits from paired still image data with source-style correspondences for training. PickStyle inserts low-rank adapters into the self-attention layers of conditioning modules, enabling efficient specialization for motion-style transfer while maintaining strong alignment between video content and style. To bridge the gap between static image supervision and dynamic video, we construct synthetic training clips from paired images by applying shared augmentations that simulate camera motion, ensuring temporal priors are preserved. In addition, we introduce Context-Style Classifier-Free Guidance (CS-CFG), a novel factorization of classifier-free guidance into independent text (style) and video (context) directions. CS-CFG ensures that context is preserved in generated video while the style is effectively transferred. Experiments across benchmarks show that our approach achieves temporally coherent, style-faithful, and content-preserving video translations, outperforming existing baselines both qualitatively and quantitatively.