Papers
Topics
Authors
Recent
2000 character limit reached

GRASSLAND: Dynamic Maze Benchmark

Updated 15 January 2026
  • GRASSLAND is a multimodal benchmark that evaluates dynamic spatial reasoning using a pixel-grid maze framework for tasks like Maze Judgment and Navigation.
  • It integrates both visual and textual cues in evolving environments, requiring models to track agents and plan safe routes amidst shifting hazards.
  • The benchmark employs rigorous evaluation protocols, highlighting the benefits of dynamic draft-augmented reasoning (D2R) to improve navigation accuracy and decision-making.

GRASSLAND is a multimodal benchmark developed to evaluate dynamic spatial reasoning in vision-equipped LLMs (MLLMs), with particular emphasis on evolving environments where obstacles may change over time. It operates on a pixel-grid maze framework, requiring integration of spatial and temporal cues, and explicitly targets scenarios—such as robotic navigation and interactive gaming—where conventional static-image and text-only reasoning benchmarks are insufficient. GRASSLAND comprises two principal tasks: Maze Judgment (evaluating the ability to track an agent through a changing world) and Maze Navigation (testing safe path planning under dynamic hazards and constraints) (Ou et al., 22 May 2025).

1. Maze Architecture and Task Definition

GRASSLAND mazes are instantiated on discrete grids of fixed dimensions: 7×7 for Maze Judgment and 5×5 for Maze Navigation. Each maze contains a designated start position psp_s (white flag) and a goal pep_e (red flag). Grid cells are classified as Empty/Grass (traversable), Wall (PoP_o, impassable), Static Trap (water, PwP_w, fatal upon contact), or Dynamic Trap (lava, PlP_l, which shifts position each time step).

Time is quantized into TT frames per episode. At every step, the agent executes a single move (up/down/left/right), followed by dynamic lava traps shifting according to a predetermined pattern. Trap-agent collisions are terminal events, with the trap taking priority if simultaneous cell occupation occurs.

Maze Judgment task: Given the full TT-step action sequence Raction={r1,...,rT}R_\text{action} = \{r_1, ..., r_T\} and the video of the dynamic maze, the model must predict the final outcome: success, fall into water, fall into lava, or safe but failed to reach the destination. Formally,

st=f(W,Raction<t,s<t),t=1,...,Ts_t = f(W, R_{\text{action}<t}, s_{<t}), \quad t=1,...,T

where sts_t encodes the agent’s location and state at time tt. Final state send=sTs_\text{end} = s_T is the judgment target.

Maze Navigation task: The model must plan a safe route RactionR_\text{action} from psp_s to pep_e within a step limit LL, strictly avoiding all traps (PD=Pl∪PwP_D = P_l \cup P_w). The path must satisfy

rt,pt=f(W,rt−1,pt−1),t=1,...,Tr_t, p_t = f(W, r_{t-1}, p_{t-1}), \quad t=1,...,T

∀t<T:  pt∉PD,T≤L\forall t < T: \; p_t \notin P_D,\quad T \leq L

with success if pT=pep_T = p_e.

2. Dataset Construction and Difficulty Regimes

GRASSLAND implements three complexity levels (easy, normal, hard) for each task, using hundreds of randomly generated mazes per level to ensure diversity without prescribed train/validation/test splits—methods are compared on identical held-out sets.

Table 1. Summary Statistics for Maze Judgment and Maze Navigation

Task Level Obstacles Dynamic Traps Static Traps Avg. Route Length
Maze Judgment easy 0 2 0 5.32
normal 1 3 1 6.00
hard 2 4 2 5.67
Maze Navigation easy 1 1 0–4 3.47
normal 2 2 0–4 3.75
hard 3 2 0–6 4.34

Lava traps execute per-maze shift patterns (up to one cell per frame); walls and water are static per episode.

3. Evaluation Protocol and Metrics

Performance is assessed via:

  • Accuracy (S):

S=#correct episodes#total episodesS = \frac{\#\text{correct episodes}}{\#\text{total episodes}}

(applies to both tasks).

  • Path Efficiency (E; Maze Navigation):

E=L∗LagentE = \frac{L^*}{L_{\text{agent}}}

with L∗L^* the shortest safe route and LagentL_{\text{agent}} the reported solution length.

  • Average steps per navigation episode tracks mean solution length versus provided answer.

Task outputs are exact (judgment) or sequence-based (navigation), with dynamic hazards requiring prospective world-modeling at each temporal frame.

4. Baseline Performance and Reasoning Patterns

Direct prompting of seven contemporary MLLMs on GRASSLAND (hard level) yields low accuracy, especially for Maze Navigation.

Table 2. Direct-Prompt Accuracy (%) for Maze Judgment and Navigation (Hard)

Model Judgment Navigation
VideoLLaMA3-7B 11.0 0.0
Qwen2.5VL-7B 28.5 1.0
InternVL2.5-8B 19.5 0.5
Qwen2.5VL-32B 9.0 0.0
InternVL2.5-38B 25.0 3.5
Qwen2.5VL-72B 19.0 6.5
QwenVL-Max 14.0 1.5

Reasoning pattern ablations indicate that:

  • Chain-of-Thought (CoT) and 1-shot CoT provide only marginal improvement or negative impacts on judgment tasks.
  • Vision-Augmented Prompting (VAP) may degrade accuracy if visual hints are noisy.
  • Draft CoT (GT)—manually overlaying ground-truth path—produces substantial performance gains.

5. Dynamic Draft-Augmented Reasoning (D2R)

D2R is a training-free framework that fuses textual chains-of-thought with dynamic visual drafts, overlayed stepwise onto input frames. This bimodal draft ensures simultaneous model updating of both symbolic and perceptual world models throughout the reasoning trajectory.

Table 3. Maze Judgment Accuracy Across Reasoning Methods (Qwen2.5VL-7B/-72B, QwenVL-Max)

Model Method Easy Norm Hard Avg
Qwen2.5VL-7B Direct 22.5 34.0 28.5 28.3
CoT 18.0 29.0 26.5 24.5
1-shot 18.0 20.5 17.0 18.5
VAP 13.5 15.0 20.0 16.2
D2R 34.0 46.0 28.0 36.0
Qwen2.5VL-72B Direct 61.0 38.5 19.0 39.5
CoT 67.0 40.0 23.0 43.3
1-shot 71.0 46.5 25.5 47.7
VAP 15.5 20.0 15.0 16.8
D2R 67.0 49.0 41.0 52.3
QwenVL-Max Direct 40.0 21.5 14.0 25.2
CoT 36.0 24.0 11.5 23.8
1-shot 18.0 17.0 9.5 14.8
VAP 15.0 9.0 13.0 12.3
D2R 46.5 35.5 28.0 36.7

Ablation studies confirm that omission of either the textual reasoning or visual drafts from D2R sharply reduces accuracy, with the absence of visual drafts exerting the most pronounced effect.

Furthermore, D2R approaches the performance of an oracle system using ground-truth overlays (Draft CoT (GT)), as shown by close accuracy margins for hard Maze Judgment.

6. Illustrative Reasoning Episodes in Dynamic Mazes

Dynamic sequence illustrations exemplify the bimodal reasoning strength of D2R:

  • Maze Judgment: For each frame, a black square overlay marks the agent’s evolving position, with a textual rationale ("Next action Go right → target cell is grass → safe") updating at every step. Upon detection of a fatal move ("Go up → target is water → fail"), the MLLM halts and outputs the appropriate terminal judgment.
  • Maze Navigation: For every iteration, trap positions are overlaid (e.g., green arrows for intended agent movement) and textually described ("Safe move: Go right"). This systematic draft chain-of-thought allows cumulative planning that adapts to the time-varying hazard topology.

The GRASSLAND benchmark thus demonstrates that combining visual and textual chains-of-thought is essential for robust dynamic spatial reasoning in evolving environments, significantly elevating MLLM performance over direct prompting and unimodal reasoning approaches (Ou et al., 22 May 2025).

7. Significance, Implications, and Benchmark Utility

GRASSLAND fills a distinct gap in multimodal benchmark design by introducing dynamic environmental changes as a first-class challenge for spatial reasoning. A plausible implication is that static-image and text-only benchmarks are inadequate for evaluating real-world agents in environments where topology changes as computation proceeds. By establishing robust baselines and demonstrating substantive accuracy improvements via D2R, GRASSLAND enables systematic evaluation of multi-turn, temporally-aware reasoning strategies for navigation and judgment. This suggests that future research into multimodal, temporally-coherent reasoning architectures may draw substantially from D2R and GRASSLAND methodologies. The project’s open-source implementation facilitates reproducible research and comparative evaluation across models, architectures, and prompting paradigms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GRASSLAND Benchmark.