AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

Yiwu Zhong, Zhuoming Liu, Yin Li, Liwei Wang

2024-12-04

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

Summary

This paper discusses AIM, a new method that improves the efficiency of multi-modal large language models (LLMs) by merging and pruning visual tokens to reduce computational demands while maintaining performance.

What's the problem?

Multi-modal LLMs, which can understand both text and visual data like images and videos, often require a lot of computational power due to the large number of visual tokens they process. This high demand makes it difficult to use these models in situations where resources are limited or when dealing with long videos, leading to inefficiencies and slower performance.

What's the solution?

The researchers propose a training-free adaptive inference method called AIM. This method involves two main steps: first, it merges similar visual tokens before they enter the LLM to reduce their number without losing important information. Second, it prunes (removes) less important tokens within the LLM itself based on their relevance to the task at hand. This approach allows the model to operate more efficiently by cutting down on unnecessary computation while still delivering accurate results.

Why it matters?

This research is important because it makes advanced AI models more accessible and practical for real-world applications where computational resources may be limited, such as on mobile devices or in low-power environments. By improving how these models handle visual data, AIM can enhance tasks like video understanding and image analysis, making AI technology more effective and user-friendly.

Abstract

Large language models (LLMs) have enabled the creation of multi-modal LLMs that exhibit strong comprehension of visual data such as images and videos. However, these models usually rely on extensive visual tokens from visual encoders, leading to high computational demands, which limits their applicability in resource-constrained environments and for long-context tasks. In this work, we propose a training-free adaptive inference method for multi-modal LLMs that can accommodate a broad range of efficiency requirements with a minimum performance drop. Our method consists of a) iterative token merging based on embedding similarity before LLMs, and b) progressive token pruning within LLM layers based on multi-modal importance. With a minimalist design, our method can be applied to both video and image LLMs. Extensive experiments on diverse video and image benchmarks demonstrate that, our method substantially reduces computation load (e.g., a 7-fold reduction in FLOPs) while preserving the performance of video and image LLMs. Further, under a similar computational cost, our method outperforms the state-of-the-art methods in long video understanding (e.g., +4.6 on MLVU). Additionally, our in-depth analysis provides insights into token redundancy and LLM layer behaviors, offering guidance for future research in designing efficient multi-modal LLMs. Our code will be available at https://github.com/LaVi-Lab/AIM.

View Paper