Vidi: Large Multimodal Models for Video Understanding and Editing

Vidi Team, Celong Liu, Chia-Wen Kuo, Dawei Du, Fan Chen, Guang Chen, Jiamin Yuan, Lingxi Zhang, Lu Guo, Lusha Li, Longyin Wen, Qingyu Chen, Rachel Deng, Sijie Zhu, Stuart Siew, Tong Jin, Wei Lu, Wen Zhong, Xiaohui Shen, Xin Gu, Xing Mei, Xueqiong Qu

2025-04-23

Vidi: Large Multimodal Models for Video Understanding and Editing

Summary

This paper talks about Vidi, a group of powerful AI models that can understand and edit videos by analyzing both the visuals and sounds over long periods of time, making them especially good at finding and working with specific moments in videos.

What's the problem?

The problem is that most current AI tools struggle to handle long videos with different types of information, like images and audio, all at once. This makes it hard for them to accurately find or edit specific parts of a video, especially when compared to more advanced or private models.

What's the solution?

The researchers created Vidi, which is designed to process and understand long videos by combining information from multiple sources, such as visuals and audio, and then using this understanding to perform tasks like searching for moments or editing content. Vidi was tested and shown to beat other models on a special video editing challenge called the VUE-TR benchmark.

Why it matters?

This matters because it means video editing and searching can become much more accurate and efficient, helping creators, editors, and even everyday users work with videos in smarter and more powerful ways.

Abstract

Vidi, a family of Large Multimodal Models, excels in temporal retrieval for video editing by processing long, multimodal video content and outperforming proprietary models on the VUE-TR benchmark.

View Paper