Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator
Luozheng Qin, Jia Gong, Qian Qiao, Tianjiao Li, Li Xu, Haoyu Pan, Chao Qu, Zhiyu Tan, Hao Li
2026-04-14
Summary
This paper introduces Uni-ViGU, a new approach to building AI models that can both understand and generate videos and text. Instead of making understanding-focused models also generate content, it builds a generation-focused model and then adds understanding capabilities to it.
What's the problem?
Current AI models that handle both images/videos and text have a big problem: creating videos is much more computationally expensive than understanding them. This means models struggle to efficiently handle both tasks equally well, because they're built starting with understanding and *then* trying to add generation, which is harder. It's like trying to add a powerful engine to a car that wasn't designed for it.
What's the solution?
Uni-ViGU flips this around. It starts with a strong video *generation* model and then equips it with the ability to understand text and video. It uses a clever technique called 'unified flow' to smoothly connect the continuous nature of video with the discrete nature of text. It also uses a special system with lightweight layers to help the model generate text without messing up its video-making skills. Finally, it trains the model in two steps: first, it makes sure the model remembers the connection between text and video, and second, it refines the model to really understand the details in captions.
Why it matters?
This work is important because it shows a more efficient way to build powerful AI models that can work with both videos and text. By starting with generation, it avoids the computational bottlenecks of traditional approaches and opens the door to creating more scalable and capable multimodal AI systems. It suggests that building from a generation base is a promising path forward for AI that can truly understand and create content.
Abstract
Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher computational costs than understanding, particularly for video. This imbalance motivates us to invert the conventional paradigm: rather than extending understanding-centric MLLMs to support generation, we propose Uni-ViGU, a framework that unifies video generation and understanding by extending a video generator as the foundation. We introduce a unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process, enabling coherent multimodal generation. We further propose a modality-driven MoE-based framework that augments Transformer blocks with lightweight layers for text generation while preserving generative priors. To repurpose generation knowledge for understanding, we design a bidirectional training mechanism with two stages: Knowledge Recall reconstructs input prompts to leverage learned text-video correspondences, while Capability Refinement fine-tunes on detailed captions to establish discriminative shared representations. Experiments demonstrate that Uni-ViGU achieves competitive performance on both video generation and understanding, validating generation-centric architectures as a scalable path toward unified multimodal intelligence. Project Page and Code: https://fr0zencrane.github.io/uni-vigu-page/.