Ariadne: A Controllable Framework for Probing and Extending VLM Reasoning Boundaries
Minghe Shen, Zhuo Zhi, Chonghan Liu, Shuo Xing, Zhengzhong Tu, Che Liu
2025-11-11
Summary
This paper investigates whether training AI models that can understand both images and language with reinforcement learning can actually improve their ability to solve visual problems, specifically those involving spatial reasoning like navigating mazes.
What's the problem?
Current AI models that combine vision and language are often tested on tasks that rely more on language skills, like solving math problems. It's unclear if reinforcement learning can genuinely *teach* these models new visual skills they didn't have before, especially when they initially struggle with tasks that require understanding where things are in relation to each other, like finding a path through a maze.
What's the solution?
The researchers created a virtual maze environment where they could precisely control the difficulty of the mazes. They then used reinforcement learning to train the AI models in these mazes, giving them rewards for finding the correct path. This training wasn't just random; it started with easier mazes and gradually increased the difficulty. They found that this approach significantly improved the model's ability to solve mazes, even ones it couldn't solve at all before training.
Why it matters?
This research shows that reinforcement learning can actually expand the capabilities of these AI models, allowing them to learn new visual skills. Importantly, the skills learned in the virtual mazes also transferred to real-world navigation tasks, like understanding maps of museums or subway systems, suggesting this method could be useful for building AI that can better interact with the physical world.
Abstract
While Vision-Language Models (VLMs) post-trained with Reinforcement Learning (RL) show impressive general reasoning, their evaluation is often confined to language-dominant tasks (e.g., math). This raises a critical question: can RL post-training truly extend the inherent capability boundary of a base VLM, particularly for visual-centric spatial tasks where it initially fails? To investigate this, we introduce Ariadne, a framework utilizing synthetic mazes for multi-step spatial reasoning where task difficulty (e.g., path length, turns) is precisely controlled. We leverage this controllable environment to train VLMs using Reinforcement Learning with Verified Rewards (RLVR) in a difficulty-aware curriculum. Surprisingly, post-RLVR training, the VLM achieves over 50% accuracy on a problem set where the base model scored 0%, demonstrating that our approach expands the model's initial capability boundary. To assess real-world viability, we evaluate out-of-distribution (OOD) generalization on practical benchmarks. Despite training only on synthetic maze samples, Ariadne achieves significant zero-shot improvements, averaging 16% on MapBench (e.g., museum navigation) and 24% on ReasonMap (subway transfer tasks). These results confirm that our method not only broadens the model's fundamental limits but also enhances its generalization to real-world spatial reasoning. We acknowledge our study is limited to the post-training phase, given the opaqueness of pre-training data, and hope our research motivates further work on specialized, capability-extending alignment.