DreamStyle: A Unified Framework for Video Stylization

Mengtian Li, Jinshu Chen, Songtao Zhao, Wanquan Feng, Pengqi Tu, Qian He

2026-01-07

DreamStyle: A Unified Framework for Video Stylization

Summary

This paper introduces DreamStyle, a new system for changing the style of videos. It allows you to make videos look like they were painted in a certain style, or match the look of a reference image, or even continue the style of the very first frame throughout the whole video.

What's the problem?

Currently, methods for stylizing videos are limited because they usually only work with one type of instruction – either a text description of the style, a style image, or just the first frame of the video. Using only one type of instruction isn't ideal because each has its strengths and weaknesses. Also, there aren't many good datasets available for training these systems, which leads to videos that look inconsistent or 'flicker' with unwanted changes in style over time.

What's the solution?

The researchers created DreamStyle, which can handle all three types of style instructions: text, images, and the first frame. It's built on an existing image-to-video model, but they’ve improved it using a technique called LoRA, which helps the model understand the different instructions without getting confused. They also created a new, high-quality dataset of videos to train DreamStyle, ensuring better consistency and quality.

Why it matters?

DreamStyle is important because it's a more versatile and effective way to stylize videos. It overcomes the limitations of previous methods by supporting multiple style conditions and producing videos that are more visually consistent and higher quality. This opens up possibilities for more creative control in video editing and generation.

Abstract

Video stylization, an important downstream task of video generation models, has not yet been thoroughly explored. Its input style conditions typically include text, style image, and stylized first frame. Each condition has a characteristic advantage: text is more flexible, style image provides a more accurate visual anchor, and stylized first frame makes long-video stylization feasible. However, existing methods are largely confined to a single type of style condition, which limits their scope of application. Additionally, their lack of high-quality datasets leads to style inconsistency and temporal flicker. To address these limitations, we introduce DreamStyle, a unified framework for video stylization, supporting (1) text-guided, (2) style-image-guided, and (3) first-frame-guided video stylization, accompanied by a well-designed data curation pipeline to acquire high-quality paired video data. DreamStyle is built on a vanilla Image-to-Video (I2V) model and trained using a Low-Rank Adaptation (LoRA) with token-specific up matrices that reduces the confusion among different condition tokens. Both qualitative and quantitative evaluations demonstrate that DreamStyle is competent in all three video stylization tasks, and outperforms the competitors in style consistency and video quality.

View Paper