MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention
Yucheng Li, Huiqiang Jiang, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Amir H. Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, Lili Qiu
2025-04-29
Summary
This paper talks about MMInference, a new method that makes Vision Language Models (VLMs)—which are AIs that understand both pictures and words—much faster when they need to process really long videos or lots of information at once.
What's the problem?
The problem is that when these models try to handle a huge amount of video and text data together, they can get really slow because they have to look at every single detail in every frame and word, which takes a lot of time and computer power.
What's the solution?
The researchers created a special technique called dynamic sparse attention, which lets the model focus only on the most important parts of the video and text, instead of everything at once. They also designed it to recognize where the video ends and the text begins, so it doesn't waste effort. This approach speeds up the first stage of processing, called pre-filling, by more than eight times when working with a million pieces of information.
Why it matters?
This matters because it means AI can now quickly and efficiently understand and answer questions about long videos or big sets of data, making it much more useful for things like analyzing movies, sports games, or security footage.
Abstract
MMInference, a dynamic sparse attention method, accelerates the pre-filling stage of Vision Language Models by leveraging unique sparse patterns in video input and modality boundary handling, achieving up to 8.3x speedup at 1M tokens.