Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging
Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, Junxian He
2025-05-14
Summary
This paper talks about combining different AI models that understand language and images to help the image models gain reasoning skills without needing extra training.
What's the problem?
The problem is that models designed to understand images (vision models) usually aren't as good at reasoning or thinking through problems as language models, and teaching them these skills often requires a lot of extra work.
What's the solution?
The researchers merged parts of language models with vision-language models, allowing the vision models to inherit reasoning abilities directly. They studied which layers of the models contribute most to perception and reasoning, showing that this merging works well without retraining the models.
Why it matters?
This matters because it helps create smarter AI systems that can both see and think better, making them more useful for tasks like understanding pictures, answering questions about images, and solving complex problems involving both vision and language.
Abstract
Merging models across modalities effectively transfers reasoning abilities from LLMs to VLMs without additional training, revealing layer-specific contributions to perception and reasoning.