Efficient Multi-modal Large Language Models via Progressive Consistency Distillation

Zichen Wen, Shaobo Wang, Yufa Zhou, Junyuan Zhang, Qintong Zhang, Yifeng Gao, Zhaorun Chen, Bin Wang, Weijia Li, Conghui He, Linfeng Zhang

2025-10-06

Efficient Multi-modal Large Language Models via Progressive Consistency Distillation

Summary

This paper focuses on making multi-modal large models, which process both text and images, more efficient. These models currently require a lot of computing power, especially when dealing with the visual information from images.

What's the problem?

When researchers try to make these models more efficient by simplifying the image data they process, it becomes much harder for the model to learn. Imagine trying to learn something when the information you're given keeps changing drastically – it's confusing! Existing methods don't really address this increased difficulty in learning caused by simplifying the image data.

What's the solution?

The researchers developed a new training method called EPIC, which stands for Efficient MLLMs via Progressive Consistency Distillation. They break down *how* the image simplification changes the data, looking at it both for individual image pieces (tokens) and across the different layers of the model. Then, they use a more experienced 'teacher' model to guide the learning process, gradually increasing the complexity of the simplified image data as the model learns. This makes the learning process smoother and more effective.

Why it matters?

This work is important because it provides a way to build more efficient multi-modal models without sacrificing their performance. More efficient models mean they can be used on less powerful hardware, making them more accessible and practical for a wider range of applications, and potentially reducing the cost of running them.

Abstract

Visual tokens consume substantial computational resources in multi-modal large models (MLLMs), significantly compromising their efficiency. Recent works have attempted to improve efficiency by compressing visual tokens during training, either through modifications to model components or by introducing additional parameters. However, they often overlook the increased learning difficulty caused by such compression, as the model's parameter space struggles to quickly adapt to the substantial perturbations in the feature space induced by token compression. In this work, we propose to develop Efficient MLLMs via Progressive Consistency Distillation (EPIC), a progressive learning framework. Specifically, by decomposing the feature space perturbations introduced by token compression along the token-wise and layer-wise dimensions, we introduce token consistency distillation and layer consistency distillation, respectively, aiming to reduce the training difficulty by leveraging guidance from a teacher model and following a progressive learning trajectory. Extensive experiments demonstrate the superior effectiveness, robustness, and generalization capabilities of our proposed framework.

View Paper