Understanding and Harnessing Sparsity in Unified Multimodal Models

Shwai He, Chaorui Deng, Ang Li, Shen Yan

2025-12-03

Understanding and Harnessing Sparsity in Unified Multimodal Models

Summary

This paper investigates large AI models that can both understand and create content from different types of data, like text and images. It focuses on making these powerful, all-in-one models more efficient.

What's the problem?

These unified models, while versatile, can be wasteful. They often use a lot of computing power even when a specific task doesn't need all of their capabilities. The researchers wanted to understand *which* parts of these models are essential and which are not, and they found that the parts responsible for *generating* new content are much more sensitive to being trimmed down than the parts that *understand* information. Simply making the model smaller overall significantly hurt its ability to create things.

What's the solution?

To fix this, the researchers took inspiration from how our brains work – activating only the necessary parts for a given task. They split the content-generating part of the model into 'experts,' and only activated the experts best suited for each specific task. This 'Mixture-of-Experts' approach allowed them to significantly reduce the number of active components without losing performance. They fine-tuned the model with this new setup, and even allowed it to fully learn which experts to use, further improving results.

Why it matters?

This work is important because it makes these large, multimodal AI models more practical. By activating only half the model's components, they achieved the same performance as the full model, meaning less computing power is needed. This makes these advanced AI systems more accessible and environmentally friendly, paving the way for wider use in various applications.

Abstract

Large multimodal models have achieved remarkable progress in both understanding and generation. Recent efforts pursue unified multimodal models that integrate heterogeneous components to support both capabilities within a single framework. However, such unification introduces inference inefficiencies, e.g., specific tasks or samples may not require the full knowledge or capacity of the unified model. Yet, a systematic understanding of how these inefficiencies manifest across different components remains limited. In this work, we first conduct a systematic analysis of unified multimodal model components using training-free pruning as a probing methodology, considering both depth pruning and width reduction. Our study reveals that the understanding component exhibits notable compressibility in both understanding and generation tasks, which is more pronounced in the latter. In contrast, the generation components are highly sensitive to compression, with performance deteriorating sharply even under moderate compression ratios. To address this limitation, we propose the Mixture-of-Experts (MoE) Adaptation, inspired by the dynamic activation patterns observed across different samples. This approach partitions the generation module into multiple experts and enables sparse activation to restore generation quality. We validate the effectiveness of sparse activation through expert-frozen tuning and further demonstrate that a fully trainable adaptation delivers additional gains. As a result, the adapted BAGEL model achieves performance comparable to the full model while activating only about half of its parameters. The code is released at https://github.com/Shwai-He/SparseUnifiedModel{this link}.

View Paper