Directional Reasoning Injection for Fine-Tuning MLLMs
Chao Huang, Zeliang Zhang, Jiang Liu, Ximeng Sun, Jialian Wu, Xiaodong Yu, Ze Wang, Chenliang Xu, Emad Barsoum, Zicheng Liu
2025-10-23
Summary
This paper investigates how to make image-understanding AI models, called Multimodal Large Language Models (MLLMs), better at reasoning – things like solving math problems based on images. It proposes a new, efficient way to improve their reasoning skills without needing huge amounts of extra training data or complex methods.
What's the problem?
While MLLMs are getting good at understanding both images and text, they often aren't as strong at complex reasoning tasks compared to AI models that *only* work with text. Existing ways to fix this, like training on tons of examples or using reinforcement learning, are really expensive and require a lot of computing power. Simply combining a reasoning-focused text model with an image model doesn't always work well, and can even make performance worse depending on the specific models used.
What's the solution?
The researchers developed a technique called DRIFT, which stands for Directional Reasoning Injection for Fine-Tuning. Instead of directly merging models or retraining everything from scratch, DRIFT figures out the difference in the model's internal settings between a strong reasoning model and the image-understanding model. Then, during a normal training process with images and text, it subtly nudges the model's updates in a direction that encourages better reasoning, based on that pre-calculated difference. This is a lightweight way to transfer reasoning ability without messing up the model's ability to understand images.
Why it matters?
This research is important because it offers a practical and cost-effective way to improve the reasoning abilities of MLLMs. It avoids the need for massive datasets or computationally expensive training, making it easier to build more intelligent AI systems that can understand and reason about the world around us, as seen in images and text.
Abstract
Multimodal large language models (MLLMs) are rapidly advancing, yet their reasoning ability often lags behind that of strong text-only counterparts. Existing methods to bridge this gap rely on supervised fine-tuning over large-scale multimodal reasoning data or reinforcement learning, both of which are resource-intensive. A promising alternative is model merging, which interpolates parameters between reasoning-enhanced LLMs and multimodal variants. However, our analysis shows that naive merging is not always a "free lunch": its effectiveness varies drastically across model families, with some (e.g., LLaVA, Idefics) benefiting while others (e.g., Qwen) suffer performance degradation. To address this, we propose Directional Reasoning Injection for Fine-Tuning (DRIFT) MLLMs, a lightweight method that transfers reasoning knowledge in the gradient space, without destabilizing multimodal alignment. DRIFT precomputes a reasoning prior as the parameter-space difference between reasoning and multimodal variants, then uses it to bias gradients during multimodal fine-tuning. This approach preserves the simplicity of standard supervised fine-tuning pipelines while enabling efficient reasoning transfer. Extensive experiments on multimodal reasoning benchmarks, including MathVista and MathVerse, demonstrate that DRIFT consistently improves reasoning performance over naive merging and supervised fine-tuning, while matching or surpassing training-heavy methods at a fraction of the cost.