CineScale: Free Lunch in High-Resolution Cinematic Visual Generation

Haonan Qiu, Ning Yu, Ziqi Huang, Paul Debevec, Ziwei Liu

2025-08-27

CineScale: Free Lunch in High-Resolution Cinematic Visual Generation

Summary

This paper introduces CineScale, a new method for creating higher-resolution images and videos using existing AI models without needing to retrain them extensively.

What's the problem?

Current AI models for generating images and videos struggle to create high-quality content at high resolutions like 4k or 8k. This is because they're usually trained on smaller images and videos, and when they try to create something bigger, they start to repeat patterns and lose detail. Essentially, the more detail you ask for, the more errors accumulate, leading to a blurry or repetitive final product.

What's the solution?

CineScale is a new way of *using* these existing models, rather than changing the models themselves. It's designed to handle the extra detail needed for high-resolution images and videos, preventing those repetitive patterns. The researchers created slightly different versions of CineScale to work best with different types of video generation models, and it can even create videos from images or videos from other videos. They didn't need to completely retrain the models; they used a small amount of extra training called LoRA for videos.

Why it matters?

This work is important because it allows us to generate incredibly detailed images and videos – up to 8k resolution for images and 4k for videos – without the huge cost of retraining complex AI models. This opens up possibilities for creating realistic visual content more easily and efficiently, potentially impacting fields like filmmaking, gaming, and design.

Abstract

Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions. Recent efforts have explored tuning-free strategies to exhibit the untapped potential higher-resolution visual generation of pre-trained models. However, these methods are still prone to producing low-quality visual content with repetitive patterns. The key obstacle lies in the inevitable increase in high-frequency information when the model generates visual content exceeding its training resolution, leading to undesirable repetitive patterns deriving from the accumulated errors. In this work, we propose CineScale, a novel inference paradigm to enable higher-resolution visual generation. To tackle the various issues introduced by the two types of video generation architectures, we propose dedicated variants tailored to each. Unlike existing baseline methods that are confined to high-resolution T2I and T2V generation, CineScale broadens the scope by enabling high-resolution I2V and V2V synthesis, built atop state-of-the-art open-source video generation frameworks. Extensive experiments validate the superiority of our paradigm in extending the capabilities of higher-resolution visual generation for both image and video models. Remarkably, our approach enables 8k image generation without any fine-tuning, and achieves 4k video generation with only minimal LoRA fine-tuning. Generated video samples are available at our website: https://eyeline-labs.github.io/CineScale/.

View Paper