Textual Steering Vectors Can Improve Visual Understanding in Multimodal Large Language Models
Woody Haosheng Gan, Deqing Fu, Julian Asilis, Ollie Liu, Dani Yogatama, Vatsal Sharan, Robin Jia, Willie Neiswanger
2025-05-27
Summary
This paper talks about a new way to help large language models that work with both text and images understand visual information better by using special signals from text, called textual steering vectors.
What's the problem?
The problem is that even though multimodal language models can handle both words and pictures, they sometimes struggle to accurately connect what they see in images with what they read in text. Improving this connection usually means changing the model or using lots of extra data and computer power, which isn't always practical.
What's the solution?
The authors found that by creating steering vectors from text using techniques like sparse autoencoders, mean shift, and linear probing, they could guide the model to make better connections between text and images. This approach boosts the model's accuracy in understanding visuals without needing to change the model's core settings or use a lot more data and resources.
Why it matters?
This is important because it offers a simple and efficient way to make AI systems better at tasks that involve both language and images, which can help in areas like education, accessibility, and creative design.
Abstract
Text-derived steering via sparse autoencoders, mean shift, and linear probing enhances multimodal accuracy in large language models without requiring parameter modifications or significant additional data or computation.