UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

Houyuan Chen, Hong Li, Xianghao Kong, Tianrui Zhu, Shaocong Xu, Weiqing Xiao, Yuwei Guo, Chongjie Ye, Lvmin Zhang, Hao Zhao, Anyi Rao

2026-05-04

UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

Summary

This paper introduces UniVidX, a new system for creating and manipulating videos using artificial intelligence. It's designed to handle different types of video-related tasks in a flexible way, going beyond previous methods that needed a separate AI model for each specific job.

What's the problem?

Existing AI models for video creation often focus on one specific task, like changing the color of an object or adding special effects. This means if you want to do multiple things with a video, you need multiple AI models. These models also struggle to understand the relationships between different parts of a video, like how the color of an object affects its appearance in different lighting conditions. They also require a lot of training data.

What's the solution?

UniVidX solves this by creating a single, unified framework that can handle many different video tasks. It works by cleverly modifying an existing type of AI called a video diffusion model. The key ideas are: randomly masking parts of the video during training to make the AI more adaptable, using small, focused adjustments to the AI model for each task without messing up its core abilities, and allowing different parts of the video to 'talk' to each other during creation to ensure everything looks consistent. They demonstrated this with two versions: one that creates realistic videos with information about lighting and materials, and another that separates a video into its different layers.

Why it matters?

This research is important because it makes video editing and generation much more versatile and efficient. Instead of needing a separate AI for every little change, you can use one system to do a lot. It also means you don't need massive amounts of training data to get good results, making it more accessible. This could lead to new tools for filmmakers, artists, and anyone who wants to create or modify videos.

Abstract

Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem setting, which fixes the input-output mapping and limits the modeling of correlations across modalities. We present UniVidX, a unified multimodal framework that leverages VDM priors for versatile video generation. UniVidX formulates pixel-aligned tasks as conditional generation in a shared multimodal space, adapts to modality-specific distributions while preserving the backbone's native priors, and promotes cross-modal consistency during synthesis. It is built on three key designs. Stochastic Condition Masking (SCM) randomly partitions modalities into clean conditions and noisy targets during training, enabling omni-directional conditional generation instead of fixed mappings. Decoupled Gated LoRA (DGL) introduces per-modality LoRAs that are activated when a modality serves as the generation target, preserving the strong priors of the VDM. Cross-Modal Self-Attention (CMSA) shares keys and values across modalities while keeping modality-specific queries, facilitating information exchange and inter-modal alignment. We instantiate UniVidX in two domains: UniVid-Intrinsic, for RGB videos and intrinsic maps including albedo, irradiance, and normal; and UniVid-Alpha, for blended RGB videos and their constituent RGBA layers. Experiments show that both models achieve performance competitive with state-of-the-art methods across distinct tasks and generalize robustly to in-the-wild scenarios, even when trained on fewer than 1,000 videos. Project page: https://houyuanchen111.github.io/UniVidX.github.io/

View Paper