Geometrically-Constrained Agent for Spatial Reasoning
Zeren Chen, Xiaoya Lu, Zhijie Zheng, Pengrui Li, Lehan He, Yijin Zhou, Jing Shao, Bohan Zhuang, Lu Sheng
2025-12-01
Summary
This paper addresses a key weakness in Vision Language Models (VLMs) – their difficulty with accurately understanding and reasoning about spatial relationships, even though they're good at understanding *what* things are. It introduces a new method called Geometrically-Constrained Agent (GCA) to improve this spatial reasoning ability.
What's the problem?
VLMs struggle to connect what they 'understand' about a scene to the precise geometric details. They can tell you something is 'to the left of' something else, but their internal understanding of 'left' isn't tied to actual measurements or a consistent frame of reference. Existing attempts to fix this either rely on teaching the model with potentially incorrect examples, or only constrain the final answer, not the *process* the model uses to get there, leading to flawed plans.
What's the solution?
The researchers propose GCA, which breaks down the problem into two steps without needing any additional training. First, the VLM acts like an analyst, taking a user's request and translating it into a very specific, verifiable set of rules defining the task and how space is organized. Second, the VLM then acts as a solver, using tools to complete the task, but *always* following those strict rules. This ensures the reasoning is geometrically sound because it's built on a solid, defined foundation.
Why it matters?
This work is important because it significantly improves the ability of VLMs to handle tasks requiring precise spatial understanding, like robotics or navigation. By achieving state-of-the-art results on spatial reasoning tests, and doing so without needing to retrain the model, GCA offers a practical and effective way to make these models more reliable and useful in real-world applications.
Abstract
Vision Language Models (VLMs) exhibit a fundamental semantic-to-geometric gap in spatial reasoning: they excel at qualitative semantic inference but their reasoning operates within a lossy semantic space, misaligned with high-fidelity geometry. Current paradigms fail to bridge this gap. Training-based methods suffer from an ``oracle paradox,'' learning flawed spatial logic from imperfect oracles. Tool-integrated methods constrain the final computation but critically leave the VLM's planning process unconstrained, resulting in geometrically flawed plans. In this work, we propose Geometrically-Constrained Agent (GCA), a training-free agentic paradigm that resolves this gap by introducing a formal task constraint. Specifically, we strategically decouples the VLM's role into two stages. First, acting as a semantic analyst, the VLM translates the user's ambiguous query into the formal, verifiable task constraint, which defines the reference frame and objective. Second, acting as a task solver, the VLM generates and executes tool calls strictly within the deterministic bounds defined by the constraint. This geometrically-constrained reasoning strategy successfully resolve the semantic-to-geometric gap, yielding a robust and verifiable reasoning pathway for spatial reasoning. Comprehensive experiments demonstrate that GCA achieves SOTA performance on multiple spatial reasoning benchmarks, surpassing existing training-based and tool-integrated methods by ~27%. Please see our homepage at https://gca-spatial-reasoning.github.io.