AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO

Alan Dao, Dinh Bach Vu

2025-02-21

AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via
GRPO

Summary

This paper talks about AlphaMaze, a system that helps AI models improve their ability to understand and solve visual problems, like navigating mazes, by teaching them to think step-by-step and make better decisions.

What's the problem?

AI language models are good at understanding text but struggle with tasks that require visual reasoning, such as solving mazes. These tasks are challenging because they involve understanding spatial relationships and making decisions based on them, which traditional language models aren't designed to handle.

What's the solution?

The researchers created AlphaMaze, which uses a two-step training process. First, they trained the AI using supervised learning with specially designed maze data to teach it basic navigation skills. Then, they used a technique called Group Relative Policy Optimization (GRPO) to refine the AI's decision-making by rewarding it for effective strategies. This approach improved the AI's accuracy in solving mazes from 86% to 93%, while also teaching it to think through problems logically and correct its own mistakes.

Why it matters?

This matters because it shows how AI can be trained to handle tasks that combine visual and logical reasoning, which are important for applications like robotics and autonomous navigation. By bridging the gap between language understanding and spatial reasoning, AlphaMaze could lead to smarter AI systems capable of solving real-world problems that require both types of intelligence.

Abstract

Large Language Models (LLMs) have demonstrated impressive capabilities in language processing, yet they often struggle with tasks requiring genuine visual spatial reasoning. In this paper, we introduce a novel two-stage training framework designed to equip standard LLMs with visual reasoning abilities for maze navigation. First, we leverage Supervised Fine Tuning (SFT) on a curated dataset of tokenized maze representations to teach the model to predict step-by-step movement commands. Next, we apply Group Relative Policy Optimization (GRPO)-a technique used in DeepSeekR1-with a carefully crafted reward function to refine the model's sequential decision-making and encourage emergent chain-of-thought behaviors. Experimental results on synthetically generated mazes show that while a baseline model fails to navigate the maze, the SFT-trained model achieves 86% accuracy, and further GRPO fine-tuning boosts accuracy to 93%. Qualitative analyses reveal that GRPO fosters more robust and self-corrective reasoning, highlighting the potential of our approach to bridge the gap between language models and visual spatial tasks. These findings offer promising implications for applications in robotics, autonomous navigation, and other domains that require integrated visual and sequential reasoning.

View Paper