Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs
Yifan Shen, Yuanzhe Liu, Jingyuan Zhu, Xu Cao, Xiaofeng Zhang, Yixiao He, Wenming Ye, James Matthew Rehg, Ismini Lourentzou
2025-06-30
Summary
This paper talks about SpatialReasoner-R1, a new vision-language model that is designed to better understand and reason about spatial relationships in images with detailed and logical steps.
What's the problem?
Current vision-language models have trouble accurately figuring out complex spatial details and multi-step logic in images, which limits their ability to answer questions or make sense of scenes properly.
What's the solution?
The researchers developed SpatialReasoner-R1 which uses a special method called Multi-Model Monte Carlo Tree Search to create diverse training examples with detailed reasoning steps. They also introduced fine-grained Direct Preference Optimization, a way to train the model by focusing separately on describing what is where and on logical reasoning, improving its overall accuracy.
Why it matters?
This is important because it helps AI models better understand space and relationships in images, which can improve technologies like robotics, augmented reality, and assistive tools for the visually impaired, making them smarter and more reliable.
Abstract
SpatialReasoner-R1, a vision-language reasoning model, uses Multi-Model Monte Carlo Tree Search and fine-grained Direct Preference Optimization to improve spatial reasoning, setting a new state-of-the-art on SPATIALRGPT-Bench.