GRASSLAND: Dynamic Maze Benchmark

Updated 15 January 2026

GRASSLAND is a multimodal benchmark that evaluates dynamic spatial reasoning using a pixel-grid maze framework for tasks like Maze Judgment and Navigation.
It integrates both visual and textual cues in evolving environments, requiring models to track agents and plan safe routes amidst shifting hazards.
The benchmark employs rigorous evaluation protocols, highlighting the benefits of dynamic draft-augmented reasoning (D2R) to improve navigation accuracy and decision-making.

GRASSLAND is a multimodal benchmark developed to evaluate dynamic spatial reasoning in vision-equipped LLMs (MLLMs), with particular emphasis on evolving environments where obstacles may change over time. It operates on a pixel-grid maze framework, requiring integration of spatial and temporal cues, and explicitly targets scenarios—such as robotic navigation and interactive gaming—where conventional static-image and text-only reasoning benchmarks are insufficient. GRASSLAND comprises two principal tasks: Maze Judgment (evaluating the ability to track an agent through a changing world) and Maze Navigation (testing safe path planning under dynamic hazards and constraints) (Ou et al., 22 May 2025).

1. Maze Architecture and Task Definition

GRASSLAND mazes are instantiated on discrete grids of fixed dimensions: 7×7 for Maze Judgment and 5×5 for Maze Navigation. Each maze contains a designated start position $p_s$ (white flag) and a goal $p_e$ (red flag). Grid cells are classified as Empty/Grass (traversable), Wall ( $P_o$ , impassable), Static Trap (water, $P_w$ , fatal upon contact), or Dynamic Trap (lava, $P_l$ , which shifts position each time step).

Time is quantized into $T$ frames per episode. At every step, the agent executes a single move (up/down/left/right), followed by dynamic lava traps shifting according to a predetermined pattern. Trap-agent collisions are terminal events, with the trap taking priority if simultaneous cell occupation occurs.

Maze Judgment task: Given the full $T$ -step action sequence $R_\text{action} = \{r_1, ..., r_T\}$ and the video of the dynamic maze, the model must predict the final outcome: success, fall into water, fall into lava, or safe but failed to reach the destination. Formally,

$s_t = f(W, R_{\text{action}<t}, s_{<t}), \quad t=1,...,T$

where $s_t$ encodes the agent’s location and state at time $t$ . Final state $s_\text{end} = s_T$ is the judgment target.

Maze Navigation task: The model must plan a safe route $R_\text{action}$ from $p_s$ to $p_e$ within a step limit $L$ , strictly avoiding all traps ( $P_D = P_l \cup P_w$ ). The path must satisfy

$r_t, p_t = f(W, r_{t-1}, p_{t-1}), \quad t=1,...,T$

$\forall t < T: \; p_t \notin P_D,\quad T \leq L$

with success if $p_T = p_e$ .

2. Dataset Construction and Difficulty Regimes

GRASSLAND implements three complexity levels (easy, normal, hard) for each task, using hundreds of randomly generated mazes per level to ensure diversity without prescribed train/validation/test splits—methods are compared on identical held-out sets.

Task	Level	Obstacles	Dynamic Traps	Static Traps	Avg. Route Length
Maze Judgment	easy	0	2	0	5.32
	normal	1	3	1	6.00
	hard	2	4	2	5.67
Maze Navigation	easy	1	1	0–4	3.47
	normal	2	2	0–4	3.75
	hard	3	2	0–6	4.34

Lava traps execute per-maze shift patterns (up to one cell per frame); walls and water are static per episode.

3. Evaluation Protocol and Metrics

Performance is assessed via:

Accuracy (S):

$S = \frac{\#\text{correct episodes}}{\#\text{total episodes}}$

(applies to both tasks).

Path Efficiency (E; Maze Navigation):

$E = \frac{L^*}{L_{\text{agent}}}$

with $L^*$ the shortest safe route and $L_{\text{agent}}$ the reported solution length.

Average steps per navigation episode tracks mean solution length versus provided answer.

Task outputs are exact (judgment) or sequence-based (navigation), with dynamic hazards requiring prospective world-modeling at each temporal frame.

4. Baseline Performance and Reasoning Patterns

Direct prompting of seven contemporary MLLMs on GRASSLAND (hard level) yields low accuracy, especially for Maze Navigation.

Model	Judgment	Navigation
VideoLLaMA3-7B	11.0	0.0
Qwen2.5VL-7B	28.5	1.0
InternVL2.5-8B	19.5	0.5
Qwen2.5VL-32B	9.0	0.0
InternVL2.5-38B	25.0	3.5
Qwen2.5VL-72B	19.0	6.5
QwenVL-Max	14.0	1.5

Reasoning pattern ablations indicate that:

Chain-of-Thought (CoT) and 1-shot CoT provide only marginal improvement or negative impacts on judgment tasks.
Vision-Augmented Prompting (VAP) may degrade accuracy if visual hints are noisy.
Draft CoT (GT)—manually overlaying ground-truth path—produces substantial performance gains.

5. Dynamic Draft-Augmented Reasoning (D2R)

D2R is a training-free framework that fuses textual chains-of-thought with dynamic visual drafts, overlayed stepwise onto input frames. This bimodal draft ensures simultaneous model updating of both symbolic and perceptual world models throughout the reasoning trajectory.

Table 3. Maze Judgment Accuracy Across Reasoning Methods (Qwen2.5VL-7B/-72B, QwenVL-Max)

Model	Method	Easy	Norm	Hard	Avg
Qwen2.5VL-7B	Direct	22.5	34.0	28.5	28.3
	CoT	18.0	29.0	26.5	24.5
	1-shot	18.0	20.5	17.0	18.5
	VAP	13.5	15.0	20.0	16.2
	D2R	34.0	46.0	28.0	36.0
Qwen2.5VL-72B	Direct	61.0	38.5	19.0	39.5
	CoT	67.0	40.0	23.0	43.3
	1-shot	71.0	46.5	25.5	47.7
	VAP	15.5	20.0	15.0	16.8
	D2R	67.0	49.0	41.0	52.3
QwenVL-Max	Direct	40.0	21.5	14.0	25.2
	CoT	36.0	24.0	11.5	23.8
	1-shot	18.0	17.0	9.5	14.8
	VAP	15.0	9.0	13.0	12.3
	D2R	46.5	35.5	28.0	36.7

Ablation studies confirm that omission of either the textual reasoning or visual drafts from D2R sharply reduces accuracy, with the absence of visual drafts exerting the most pronounced effect.

Furthermore, D2R approaches the performance of an oracle system using ground-truth overlays (Draft CoT (GT)), as shown by close accuracy margins for hard Maze Judgment.

6. Illustrative Reasoning Episodes in Dynamic Mazes

Dynamic sequence illustrations exemplify the bimodal reasoning strength of D2R:

Maze Judgment: For each frame, a black square overlay marks the agent’s evolving position, with a textual rationale ("Next action Go right → target cell is grass → safe") updating at every step. Upon detection of a fatal move ("Go up → target is water → fail"), the MLLM halts and outputs the appropriate terminal judgment.
Maze Navigation: For every iteration, trap positions are overlaid (e.g., green arrows for intended agent movement) and textually described ("Safe move: Go right"). This systematic draft chain-of-thought allows cumulative planning that adapts to the time-varying hazard topology.

The GRASSLAND benchmark thus demonstrates that combining visual and textual chains-of-thought is essential for robust dynamic spatial reasoning in evolving environments, significantly elevating MLLM performance over direct prompting and unimodal reasoning approaches (Ou et al., 22 May 2025).

7. Significance, Implications, and Benchmark Utility

GRASSLAND fills a distinct gap in multimodal benchmark design by introducing dynamic environmental changes as a first-class challenge for spatial reasoning. A plausible implication is that static-image and text-only benchmarks are inadequate for evaluating real-world agents in environments where topology changes as computation proceeds. By establishing robust baselines and demonstrating substantive accuracy improvements via D2R, GRASSLAND enables systematic evaluation of multi-turn, temporally-aware reasoning strategies for navigation and judgment. This suggests that future research into multimodal, temporally-coherent reasoning architectures may draw substantially from D2R and GRASSLAND methodologies. The project’s open-source implementation facilitates reproducible research and comparative evaluation across models, architectures, and prompting paradigms.

Markdown Upgrade to Chat

References (1)

Bridging the Dynamic Perception Gap: Training-Free Draft Chain-of-Thought for Dynamic Multimodal Spatial Reasoning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GRASSLAND Benchmark.

GRASSLAND: Dynamic Maze Benchmark

1. Maze Architecture and Task Definition

2. Dataset Construction and Difficulty Regimes

Table 1. Summary Statistics for Maze Judgment and Maze Navigation

3. Evaluation Protocol and Metrics

4. Baseline Performance and Reasoning Patterns

Table 2. Direct-Prompt Accuracy (%) for Maze Judgment and Navigation (Hard)

5. Dynamic Draft-Augmented Reasoning (D2R)

Table 3. Maze Judgment Accuracy Across Reasoning Methods (Qwen2.5VL-7B/-72B, QwenVL-Max)

6. Illustrative Reasoning Episodes in Dynamic Mazes

7. Significance, Implications, and Benchmark Utility

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

GRASSLAND: Dynamic Maze Benchmark

1. Maze Architecture and Task Definition

2. Dataset Construction and Difficulty Regimes

Table 1. Summary Statistics for Maze Judgment and Maze Navigation

3. Evaluation Protocol and Metrics

4. Baseline Performance and Reasoning Patterns

Table 2. Direct-Prompt Accuracy (%) for Maze Judgment and Navigation (Hard)

5. Dynamic Draft-Augmented Reasoning (D2R)

Table 3. Maze Judgment Accuracy Across Reasoning Methods (Qwen2.5VL-7B/-72B, QwenVL-Max)

6. Illustrative Reasoning Episodes in Dynamic Mazes

7. Significance, Implications, and Benchmark Utility

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research