OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models

Keda Tao, Kele Shao, Bohan Yu, Weiqiang Wang, Jian liu, Huan Wang

2025-11-19

OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models

Summary

This paper introduces a new method called OmniZip that makes processing audio and video together by computers much faster and more efficient, without needing any extra training.

What's the problem?

Currently, when computers try to understand both audio and video at the same time, it requires a lot of computing power because of the sheer amount of information involved in processing all the audio and video 'tokens' – think of these as small pieces of the data. Existing methods for simplifying this data haven't been designed to work well with both audio and video together, creating a bottleneck in performance.

What's the solution?

OmniZip solves this by smartly figuring out which parts of the audio are most important. It then uses this information to decide which parts of the video can be removed without losing crucial details. It’s like highlighting the key sounds in a video and then only focusing on the visual parts that relate to those sounds. This process, called 'token compression', happens without any additional training, meaning it can be applied right away.

Why it matters?

This is important because it allows for quicker and more efficient processing of audio and video, which is crucial for things like self-driving cars, video analysis, and any application that needs to understand the world through both sight and sound. By speeding up processing and reducing memory usage, OmniZip makes these technologies more practical and accessible.

Abstract

Omnimodal large language models (OmniLLMs) have attracted increasing research attention of late towards unified audio-video understanding, wherein processing audio-video token sequences creates a significant computational bottleneck, however. Existing token compression methods have yet to accommodate this emerging need of jointly compressing multimodal tokens. To bridge this gap, we present OmniZip, a training-free, audio-guided audio-visual token-compression framework that optimizes multimodal token representation and accelerates inference. Specifically, OmniZip first identifies salient audio tokens, then computes an audio retention score for each time group to capture information density, thereby dynamically guiding video token pruning and preserving cues from audio anchors enhanced by cross-modal similarity. For each time window, OmniZip compresses the video tokens using an interleaved spatio-temporal scheme. Extensive empirical results demonstrate the merits of OmniZip - it achieves 3.42X inference speedup and 1.4X memory reduction over other top-performing counterparts, while maintaining performance with no training.

View Paper