UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist

Zhengyang Liang, Daoan Zhang, Huichi Zhou, Rui Huang, Bobo Li, Yuechen Zhang, Shengqiong Wu, Xiaohan Wang, Jiebo Luo, Lizi Liao, Hao Fei

2025-11-14

UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist

Summary

This paper introduces UniVA, a new system designed to handle complex video tasks by combining different AI capabilities into a single, flexible framework.

What's the problem?

Current AI models are really good at *one* thing with videos, like creating them or understanding what's happening. But real-world video projects often need a bunch of these things done in a specific order – understanding, editing, adding things, and so on. It's difficult to get these specialized AI tools to work together smoothly and automatically, and existing methods are clunky and don't allow for easy back-and-forth changes.

What's the solution?

The researchers created UniVA, which uses a 'Plan-and-Act' system. Think of it like having a project manager (the planner agent) and a team of workers (the executor agents). The project manager figures out what needs to be done, breaks it down into steps, and then tells the workers to do each step using different AI tools. UniVA also has a memory system that helps it remember what's been done, what the user wants, and keeps everything consistent throughout the process. This allows for complex, interactive video editing where you can give instructions, make changes, and the system remembers everything.

Why it matters?

UniVA is important because it moves us closer to AI that can truly *work* with video in a way that's similar to how humans do. It's not just about generating a video from scratch, but about being able to edit, refine, and build upon existing videos in a smart and interactive way. The researchers also released UniVA and a set of tests (UniVA-Bench) publicly, so other researchers can build on this work and create even more powerful video AI systems.

Abstract

While specialized AI models excel at isolated video tasks like generation or understanding, real-world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open-source, omni-capable multi-agent framework for next-generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video-processing steps, while executor agents execute these through modular, MCP-based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi-level memory (global knowledge, task context, and user-specific preferences), UniVA sustains long-horizon reasoning, contextual continuity, and inter-agent communication, enabling interactive and self-reflective video creation with full traceability. This design enables iterative and any-conditioned video workflows (e.g., text/image/video-conditioned generation rightarrow multi-round editing rightarrow object segmentation rightarrow compositional synthesis) that were previously cumbersome to achieve with single-purpose models or monolithic video-language models. We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA-Bench are fully open-sourced, aiming to catalyze research on interactive, agentic, and general-purpose video intelligence for the next generation of multimodal AI systems. (https://univa.online/)

View Paper