Mavors: Multi-granularity Video Representation for Multimodal Large Language Model

Yang Shi, Jiaheng Liu, Yushuo Guan, Zhenhua Wu, Yuanxing Zhang, Zihao Wang, Weihong Lin, Jingyun Hua, Zekun Wang, Xinlong Chen, Bohan Zeng, Wentao Zhang, Fuzheng Zhang, Wenjing Yang, Di Zhang

2025-04-15

Mavors: Multi-granularity Video Representation for Multimodal Large
Language Model

Summary

This paper talks about Mavors, a new system that helps AI models understand long and complex videos by breaking down the videos into different levels of detail and keeping track of both what happens in each scene and how things change over time.

What's the problem?

The problem is that most AI models struggle to make sense of long videos because it's hard to remember all the details and the order of events, especially when videos have a lot going on both visually and over time. This makes it difficult for AI to answer questions or describe what happens in these videos accurately.

What's the solution?

The researchers created Mavors, which uses a special way of representing videos at multiple levels, from fine details to bigger picture scenes. It combines 3D convolutions to capture movement and space, Vision Transformers to understand what's in each frame, and transformer-based methods to keep track of how everything connects throughout the whole video. This approach helps the AI model keep both the details and the overall story straight.

Why it matters?

This work matters because it makes it possible for AI to better understand and analyze long videos, which could help with things like video search, summarizing movies, or even helping people with visual impairments know what's happening in a video. It pushes AI closer to understanding videos as well as humans do.

Abstract

Mavors, a novel framework for long-context video understanding, uses a multi-granularity video representation to preserve spatial fidelity and temporal continuity through 3D convolutions, Vision Transformers, and transformer-based dependency modeling.

View Paper