- The paper introduces a novel two-stage training framework, integrating SFT and GRPO to enhance LLM maze-solving precision.
- It tokenizes maze structures for step-by-step movement prediction, achieving up to 93% accuracy on the MazeBench benchmark.
- GRPO refines chain-of-thought reasoning in LLMs, demonstrating significant improvements in spatial visual reasoning.
The paper introduces a novel two-stage training framework, combining Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO), to enhance the visual spatial reasoning capabilities of LLMs for maze navigation. The method leverages a tokenized representation of mazes to train LLMs to predict step-by-step movement commands.
Introduction
The paper addresses the challenge of endowing standard LLMs with robust visual reasoning capabilities, particularly spatial understanding and sequential decision-making in visual environments. The authors hypothesize that an LLM can learn step-by-step movement commands to navigate a maze from a given origin to a target by providing it with a tokenized visual representation of the maze. The core of their approach is a two-stage training framework involving SFT and GRPO, drawing inspiration from DeepSeek-R1 [Guo2025DeepSeekR1]. To evaluate the LLM's ability to solve mazes, the authors introduce MazeBench, a comprehensive benchmark for solving mazes, and evaluate their model on MazeBench to measure maze-solving accuracy and the sophistication of its emergent reasoning behavior.
Related Work
The work builds upon Chain-of-Thought (CoT) prompting [Wei2022ChainofThought], adapting the concept of step-by-step thought process in LLMs to visual spatial reasoning for maze navigation. The research leverages SFT to equip the LLM with the ability to process tokenized visual maze inputs and predict movement tokens, building more sophisticated reasoning through reinforcement learning. The authors draw inspiration from reward shaping strategies used in DeepSeek-R1 and adapt them to visual maze navigation, designing reward components to encourage accuracy, valid movement sequences, and proper output formatting. The paper adapts the GRPO optimization strategy from DeepSeek-R1 to their visual maze navigation task, hypothesizing that similar RL techniques can drive the emergence of visual spatial reasoning in standard LLMs. The approach differs from traditional maze solvers by leveraging the reasoning capabilities of LLMs and adapting them to process and reason about visual spatial information.
Methodology
The authors designed a tokenized input format to enable the LLM to process maze information visually. Each cell in the maze is represented by a coordinate token <|row-col|>. Wall information is encoded using tokens such as <|no_wall|>, <|up_wall|>, etc. The origin and target locations are marked with <|origin|> and <|target|> tokens, respectively, and empty spaces are filled with <|blank|> tokens.
To establish performance benchmarks, the authors employed three distinct baseline models, leveraging the DeepSeek-R1 Distill-Qwen family of LLMs: DeepSeek-R1-Distill-Qwen-7B, DeepSeek-R1-Distill-Qwen-1.5B, and a customized DeepSeek-R1-Distill-Qwen-1.5B with SFT that predicts the entire sequence of movement tokens to solve a maze in a single forward pass.
For the SFT stage, the authors curated a training dataset of synthetically generated mazes with fixed sizes and varied complexity. Each maze was paired with an annotated step-by-step solution, represented as a sequence of movement tokens. The Qwen 1.5B SFT model was trained on this dataset to predict the next movement token at each step, conditioned on the maze input and the preceding movement tokens in the sequence.
Following SFT, GRPO was applied to further enhance the model's maze-solving capabilities and encourage more robust reasoning. The GRPO training utilized a smaller set of data than the SFT stage. The reward function consisted of correctness reward (+0.2 per solution step), integrity reward (+0.5 for each valid movement token), and > tag reward (+0.25 for correctly using the <think> tag). The GRPO algorithm, as employed in DeepSeek-R1, was adapted to perform reinforcement learning, estimating advantages based on relative group scores. The training pipeline consisted of two stages: SFT and GRPO, mirroring the multi-stage training approach employed in DeepSeek-R1. All experiments were conducted using NVIDIA A6000 GPUs with LORA (Low-Rank Adaptation) [hu2021loralowrankadaptationlarge] for parameter-efficient fine-tuning.
Experiments and Results
The SFT training dataset comprises 500,000 synthetically generated mazes, created using the maze-dataset framework [ivanitskiy2023configurablelibrarygeneratingmanipulating]. The mazes were generated using a randomized depth-first search algorithm, guaranteeing the existence of a solution path between the designated origin and target points within each maze, and are of a fixed size (5x5 grids). The dataset is divided into three subsets: 500,000 mazes for SFT, 16,000 mazes for GRPO, and 30,000 mazes for evaluation.
To rigorously evaluate the spatial reasoning and planning capabilities of LLMs, the authors introduce MazeBench, a benchmark consisting of a curated collection of 100 maze-solving challenges structured into three difficulty levels: Easy (50 mazes, 1-4 steps), Medium (40 mazes, 5-8 steps), and Hard (10 mazes, 9-13 steps). During evaluation, the LLM's output is parsed to extract movement tokens, and a solution is considered incorrect if the extracted sequence does not lead to the target or leads to an invalid state. The evaluation metric is the success rate: the percentage of mazes solved correctly.
The baseline model achieved 0% accuracy on MazeBench, while the SFT-only model reached 86.0%. Further enhancement with GRPO led to significant improvement, reaching 93.0% after 1600 steps of GRPO training. MazeBench scores over GRPO steps displayed a steady increase, indicating that GRPO effectively guides the model towards improved maze-solving policies. Qualitative analysis of model outputs revealed that the SFT+GRPO model exhibited more sophisticated reasoning behavior, with emergent chain-of-thought patterns and instances reminiscent of "aha moments" reported in DeepSeek-R1.
Discussion
The results demonstrate the incremental benefit of GRPO in enhancing visual maze reasoning within LLMs. The SFT+GRPO model exhibited more pronounced chain-of-thought reasoning patterns and instances of self-correction, indicating that GRPO is encouraging more sophisticated reasoning processes. The authors note that while the base DeepSeek-R1 model can perform visual reasoning with an extremely long context window, the distilled variants do not carry over these spatial reasoning abilities. The two-stage training approach effectively equips the distilled model with robust visual maze-solving skills. The authors state that further investigation is needed to explore whether more extensive GRPO training, or modifications to the reward function, could lead to more substantial performance gains, and that future work could benefit from more nuanced evaluation metrics that assess the efficiency of the generated paths, the robustness of the model to maze complexity variations, and the interpretability of the model's internal reasoning process.
Conclusion
The paper explored a novel approach to teaching visual reasoning for maze navigation to standard LLMs, investigating the efficacy of SFT in enabling an LLM to solve mazes represented as tokenized visual inputs. The authors found that a pre-trained LLM can achieve 75% accuracy without any task-specific fine-tuning, and applying SFT for just 200 steps resulted in a modest but measurable improvement to 77% accuracy. The authors proposed a two-stage training framework incorporating GRPO, inspired by DeepSeek-R1, as a future enhancement to further refine the model's maze-solving capabilities and encourage more sophisticated chain-of-thought reasoning. They suggest that tokenized visual representations, combined with appropriate training methodologies like SFT and potentially GRPO, offer a promising pathway for bridging the gap between LLMs and visual AI.