UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution
Shian Du, Menghan Xia, Chang Liu, Quande Liu, Xintao Wang, Pengfei Wan, Xiangyang Ji
2025-10-10
Summary
This paper introduces a new system called UniMMVSR that improves the quality of videos by increasing their resolution, and importantly, it does so while also following instructions given through different types of input like text, images, and other videos.
What's the problem?
Currently, methods for increasing video resolution using powerful AI models are computationally demanding. Existing techniques mostly focus on just using text prompts to guide the process, ignoring other useful information like example images or videos, which are important for creating realistic and accurate results when generating videos from multiple sources.
What's the solution?
The researchers developed UniMMVSR, a system built on a type of AI called a diffusion model. This system can take instructions from text, images, *and* videos simultaneously. They carefully figured out how to best combine these different types of input to guide the video upscaling process, and they also developed specific ways to prepare the data and train the model to effectively use all the available information. Essentially, they made a system that understands and responds to multiple kinds of creative direction.
Why it matters?
This work is significant because it allows for the creation of much higher quality videos (even up to 4K resolution) that are more faithful to the original intent, guided by a wider range of inputs than previously possible. It opens the door to more controlled and detailed video generation, moving beyond simply typing a text prompt to create a video.
Abstract
Cascaded video super-resolution has emerged as a promising technique for decoupling the computational burden associated with generating high-resolution videos using large foundation models. Existing studies, however, are largely confined to text-to-video tasks and fail to leverage additional generative conditions beyond text, which are crucial for ensuring fidelity in multi-modal video generation. We address this limitation by presenting UniMMVSR, the first unified generative video super-resolution framework to incorporate hybrid-modal conditions, including text, images, and videos. We conduct a comprehensive exploration of condition injection strategies, training schemes, and data mixture techniques within a latent video diffusion model. A key challenge was designing distinct data construction and condition utilization methods to enable the model to precisely utilize all condition types, given their varied correlations with the target video. Our experiments demonstrate that UniMMVSR significantly outperforms existing methods, producing videos with superior detail and a higher degree of conformity to multi-modal conditions. We also validate the feasibility of combining UniMMVSR with a base model to achieve multi-modal guided generation of 4K video, a feat previously unattainable with existing techniques.