Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
Kaichen Zhang, Yifei Shen, Bo Li, Ziwei Liu
2024-11-25

Summary
This paper discusses how to better understand the internal workings of large multimodal models (LMMs), which can process different types of data like text, images, and audio, by using a method called Sparse Autoencoding.
What's the problem?
While LMMs are powerful and can handle various types of information, it can be difficult for humans to understand how these models represent and process that information internally. Current methods do not effectively explain how these models work or why they make certain decisions, leading to a lack of transparency in their reasoning abilities.
What's the solution?
The authors propose a framework that uses Sparse Autoencoders (SAEs) to break down the complex representations within LMMs into simpler, more understandable features. They then create an automatic interpretation system that analyzes these features to see how they influence the model's behavior. By applying this framework to specific models like LLaVA-NeXT-8B, they demonstrate that it is possible to identify which features guide the model's responses and decisions.
Why it matters?
This research is important because it provides insights into how LMMs function, which can help developers improve these models and make them more reliable. Understanding the internal mechanisms of LMMs not only enhances their performance but also helps in identifying and correcting errors, ultimately leading to safer and more effective AI applications in various fields.
Abstract
Recent advances in Large Multimodal Models (LMMs) lead to significant breakthroughs in both academia and industry. One question that arises is how we, as humans, can understand their internal neural representations. This paper takes an initial step towards addressing this question by presenting a versatile framework to identify and interpret the semantics within LMMs. Specifically, 1) we first apply a Sparse Autoencoder(SAE) to disentangle the representations into human understandable features. 2) We then present an automatic interpretation framework to interpreted the open-semantic features learned in SAE by the LMMs themselves. We employ this framework to analyze the LLaVA-NeXT-8B model using the LLaVA-OV-72B model, demonstrating that these features can effectively steer the model's behavior. Our results contribute to a deeper understanding of why LMMs excel in specific tasks, including EQ tests, and illuminate the nature of their mistakes along with potential strategies for their rectification. These findings offer new insights into the internal mechanisms of LMMs and suggest parallels with the cognitive processes of the human brain.