Collaborative Maze-Solving Benchmark

Updated 8 November 2025

Collaborative maze-solving benchmarks are standardized evaluation protocols that assess multi-agent cooperation in exploring and navigating maze environments.
They parametrically control maze complexity, agent roles, and communication modalities to rigorously distinguish individual skills from collective performance.
These benchmarks drive advances in embodied AI and robotics by quantifying metrics such as makespan, communication cost, and exploration efficiency.

A collaborative maze-solving benchmark is a standardized testbed, environment, or evaluation protocol designed to rigorously assess the ability of multiple agents—artificial, embodied, or algorithmic—to collectively explore, map, or solve maze-structured environments. Such benchmarks are central in embodied AI, multi-agent robotics, distributed AI, and emerging large model collaboration research. Recent advances have produced a range of benchmark designs focusing on different aspects of collaboration: communication efficiency, information sharing, coordinated exploration, robustness to agent heterogeneity, and scalability.

1. Fundamental Principles and Purposes

Collaborative maze-solving benchmarks are constructed to systematically provide:

Isolation of Collaboration Capabilities: By design, many benchmarks disentangle an agent’s solo competence from its collaborative ability, distinguishing collaboration-facilitated performance gains (or deficits) from individual baseline skill. This is achieved by creating settings where the solution requires aggregation or inference over distributed partial information, synchronized action, or explicit negotiation (Davidson et al., 4 Nov 2025).
Modulable Problem Complexity: Benchmarks parametrically control maze size ( $N\times N$ ), wall density, solution path length, agent perceptual span, dynamics, stochasticity, and allowed knowledge-sharing modalities, supporting fine-grained tuning of scenario difficulty (Davidson et al., 4 Nov 2025, Skrynnik et al., 20 Jul 2024).
Scalability (Agents and Environment): Modern testbeds scale from a handful to hundreds of agents, and from small synthetic to large, real-world or simulated mazes, supporting robust statistical comparison and studying emergent phenomena (Argote-Gerald et al., 30 Oct 2025, Skrynnik et al., 20 Jul 2024).
Ecological Plausibility and Reproducibility: Output-format and communication-protocol agnosticism are sometimes emphasized, supporting naturalistic interaction and enabling evaluation of agents with differing inductive biases or architectures (Davidson et al., 4 Nov 2025, Skrynnik et al., 20 Jul 2024).

2. Architectural and Algorithmic Approaches

Several classes of algorithms and agent designs are represented and compared within collaborative maze-solving benchmarks, reflecting advances in both bio-inspired collective behaviors and engineered multi-agent systems:

Approach Class	Key Examples / Features
Memoryless Swarms	Density-driven sensor-only agents; no comm/memory/markers
Centralized Cooperative	Central server computes global plans; agents follow paths
Distributed Decentralized	Local rules, limited-range comm, spatial partitioning
Federated Learning (Perception)	Robots learn jointly via model aggregation, no raw data
Language-Mediated Collaboration	Dialogue (LLM/LLM or LLM/human), open-ended comm
Hybrid/Hierarchical	Leader-switching, area assignment, partial centralization

Memoryless Swarm Benchmarks

In the density-driven swarm paradigm, agents interact solely through short-range sensing of local density and orientation, with movement governed by density-dependent kinetic and directional choice rules:

$p_{move} = \frac{n_i}{n_i + \eta}$

$\text{Agent moves to square } j = \argmax_{j \in J(i)} (n_j^i - n_i^j)$

These swarms self-organize into coherent “soliton”-like waves that systematically explore and resolve the maze; solution fixing emerges from local flows without markers, memory, or global synchronization. Linear-time scaling ( $\tau \simeq 2(s-\lambda)$ , where $s$ is total squares, $\lambda$ the solution path length) is empirically demonstrated—providing a constructive lower bound for information processing needs in collective navigation (Sánchez et al., 22 Sep 2025).

Centralized and Potential Field Methods

Centralized controllers employing global potential fields (e.g., the HEDAC algorithm) apply the stationary heat equation on the discrete maze to generate dynamic guidance fields. Agents move towards less explored (high-potential) regions, with path planning, coverage coordination, and collision avoidance realized via gradient ascent. A successive over-relaxation (SOR) iterative solver yields high efficiency and scalability as new maze cells are dynamically discovered (Crnković et al., 2023).

Distributed and Voronoi-Partitioned Strategies

Distributed exploration mechanisms such as CU-LVP use Voronoi diagrams for adaptive spatial partitioning under communication constraints, allocating each agent to a region and using local information and periodic map sharing to optimize coverage. Utility functions combine factors for agent dispersion and expected unexplored area coverage, balancing exploration load and minimizing redundant traversal. Map synchronization and target deconfliction are driven by local, range-limited consensus (Linardakis et al., 30 May 2024).

Leader-Switching and Tree Mazes

On acyclic (tree-structured) mazes, leader-switching protocols assign a single “head” agent to execute a canonical single-agent maze solver, with other agents acting as followers maintaining a connected group. The head role is transferred as needed to avoid congestion, guaranteeing that the multi-agent path matches the solo-optimal exploration path. This method achieves makespan and sum-of-fuel metrics approaching the theoretical optimum as the agent count increases (Argote-Gerald et al., 30 Oct 2025).

Federated Learning for Perceptual Generalization

In federated maze discovery, multiple robots train perception models locally on distinct mazes and aggregate weights via FedAvg. This allows robust generalization to unseen maze topologies—something not achievable through local (solo) training. High test accuracy and correct reconstruction of novel mazes are obtained only with collaborative (federated) approaches (Ranasinghe et al., 25 Jun 2024).

Language-Mediated and Heterogeneous Collaboration

Collaboration among LLMs is systematically tested in settings where each agent observes only a partial maze and must communicate in open-ended dialogue to solve the task. Output representations are unconstrained, and grading is performed by a third model parsing the dialogue for proposed routes or actions (Davidson et al., 4 Nov 2025). Performance is quantified by binary success rates and a weighted outcome score:

$\text{Weighted Outcome} = \frac{a-b}{a}$

where $a$ is shortest path length and $b$ is the remaining distance to the goal.

Results show that high solo skill does not translate to strong collaborative performance—the “collaboration gap”—with severe degradation observed especially for distilled (smaller) models and in heterogeneous pairings unless primed by higher-skill agents.

3. Benchmark Design, Evaluation Protocols, and Metrics

Collaborative maze-solving benchmarks are defined by their rigorous protocols:

Maze Generation: Mazes are generated procedurally or by using specific graph/topology synthesis algorithms (e.g., tree mazes via randomized Prim’s algorithm), with parameters controlling size, density, and structure (Argote-Gerald et al., 30 Oct 2025, Davidson et al., 4 Nov 2025).
Agent Initialization: Agents may have varying capabilities: full map, partial observability, or different sensor models. Role allocation (leader, follower), communication limits, and initial positions are defined per protocol.
Task Types: Tasks include coverage (full exploration), goal finding, mapping, perception generalization, or route provisioning (as in VLN for guiding blind users (Kuribayashi et al., 11 May 2024)).
Interaction Modalities: Synchrony (simultaneous or turn-based actions), communication topology/range, and allowed message contents are specified explicitly or left unconstrained for ecological realism (Davidson et al., 4 Nov 2025).
Metrics: Typical metrics are makespan, sum-of-costs (total/average path length), exploration rounds, total communication overhead, exploration efficiency ( $M/\text{explorationCost}$ ), coverage rate, map quality, success rate, and performance relative to optimal or best-known solutions.

| Metric | Definition/Formula | |---------------------|-----------------------------------------------------------------------------------------------------------------------| | Makespan ( $M$ ) | $\max_i t^i_{\text{goal}}$ : time for all agents to reach goals | | Exploration Cost | $\sum d_i$ : total distance traversed by all agents | | Efficiency | $\frac{M}{\text{Exploration Cost}}$ | | Weighted Outcome | $\frac{a-b}{a}$ : progress along solution path relative to optimal (LLM settings) | | Communication Cost | Aggregate data exchanged, e.g., $4 \times \text{rows} \times \text{columns}$ per comm round in grid mapping |

Scenarios are evaluated across multiple runs per condition, randomized mazes, and varying agent counts.

4. Key Results and Benchmark Insights

Empirical findings across benchmark studies demonstrate several critical phenomena:

Minimal Cognition Swarms: Memoryless, communicationless swarms (density-driven) can reliably solve arbitrarily large and complex mazes with linear scaling, providing a hard lower bound for necessary agent-level complexity in collaborative intelligence (Sánchez et al., 22 Sep 2025).
Centralized vs. Distributed Coordination: Centralized field-based controllers (HEDAC) and spatial partitioning methods (CU-LVP) exhibit high scalability and robustness but differ in information exchange models and real-world applicability (Crnković et al., 2023, Linardakis et al., 30 May 2024).
Distributed Constraints and Communication: Incorporating limited-range broadcast and Voronoi-based region assignment enables efficient distributed exploration with strongly reduced coverage overlap and communication bottlenecks, closely matching or exceeding baseline centralized methods (Linardakis et al., 30 May 2024).
Heterogeneous and Language-Mediated Collaboration: High solo performance is insufficient for robust multi-agent collaboration under partial information, especially when agents are heterogeneous. Grounding conventions, turn ordering, and “priming” (relay inference) are important—collaborative skill does not simply “emerge” from solo reasoning (Davidson et al., 4 Nov 2025).
Generalization via Model Sharing: Collaborative federated learning allows robots to share perception models and generalize across structurally different environments, which is not achievable via local data alone (Ranasinghe et al., 25 Jun 2024).
Metrics and Protocols: Multi-metric, reproducible protocols allow precise diagnosis of bottlenecks, information-limited regimes, and collaborative failure modes, supporting fair head-to-head comparisons across diverse agent and algorithm classes.

5. Impact and Applications

Collaborative maze-solving benchmarks have catalyzed advances in:

Swarm and Collective Intelligence: Establishing minimal requirements for agent cognition and interaction for robust exploration/adaptive navigation (Sánchez et al., 22 Sep 2025).
Embodied and Multi-Agent Robotics: Informing the design of distributed exploration algorithms for search-and-rescue, environmental mapping, and robotic team deployment under real-world constraints (Crnković et al., 2023, Linardakis et al., 30 May 2024).
Federated and Cross-Agent Learning: Providing reference setups for privacy-preserving distributed learning and robust transfer in heterogeneous environments (Ranasinghe et al., 25 Jun 2024).
LLM Collaboration: Highlighting the limitations of current LLMs for agent-agent and human-agent collaboration beyond solo skill (Davidson et al., 4 Nov 2025).
Vision-Language Navigation and Assistive Robotics: Simulating collaborative navigation including human memory-based guidance, informing real-world assistive technology for navigation in structured environments (Kuribayashi et al., 11 May 2024).

6. Limitations, Open Challenges, and Future Directions

Current benchmarks expose key limitations and following challenges:

Collaboration Gap: There is a significant, quantifiable gap between individual competence and collaborative performance, especially in language-mediated or heterogeneous agent settings (Davidson et al., 4 Nov 2025).
Robustness to Communication and Perception Noise: Real-world settings require robustness to partial failures and incomplete synchronization, motivating further work in asynchronous protocols, error correction, and dynamic adaptation.
Scalability in Hard Topologies: Scaling to both very large mazes and high agent counts without centralization or global knowledge remains a central research topic (Argote-Gerald et al., 30 Oct 2025, Skrynnik et al., 20 Jul 2024).
Benchmark Protocol Evolution: As new forms of agents (e.g., LLM-based planners/controller hybrids) emerge, benchmarks evolve to test open-ended protocols, compositional reasoning, dynamic memory, and long-horizon planning (Zhao et al., 16 Sep 2025).

A plausible implication is that future collaborative maze-solving benchmarks will increasingly combine multi-agent coordination, language grounding, robust decision-making under uncertainty, and adaptive learning, while exposing evaluation protocols and metrics reflecting the full spectrum from minimal agent capability to ecological, real-world plausibility.

7. Summary Table of Leading Collaborative Maze-Solving Benchmarks

Benchmark / Paper	Core Focus	Key Metric/Feature
Density-driven swarm (Sánchez et al., 22 Sep 2025)	Minimal local-agent rule swarms	Linear-time, minimal info, no comm/memory
HEDAC (centralized, SOR) (Crnković et al., 2023)	Global potential field	Parallel scaling, robust coverage, anti-collision
Leader-switching/tree mazes (Argote-Gerald et al., 30 Oct 2025)	Distributed, tree graph mazes	Makespan, sum-of-fuel, head transfer, real robot validation
Voronoi/CU-LVP (distributed) (Linardakis et al., 30 May 2024)	Partitioning with local comm	Coverage, communication cost, overlap minimization
Federated learning robots (Ranasinghe et al., 25 Jun 2024)	Collaborative perception	Generalization, accuracy, maze discovery rate
LLM collaboration, “Collaboration Gap” (Davidson et al., 4 Nov 2025)	Open-ended, natural language	Weighted outcome, binary success, automated LM grading
POGEMA (Skrynnik et al., 20 Jul 2024)	Partial obs., competitive eval	Performance, cooperation, scalability, congestion
Memory-Maze VLN (Kuribayashi et al., 11 May 2024)	Realistic guidance, language	Success rate, SPD, coverage-length
EvoEmpirBench (Zhao et al., 16 Sep 2025)	Dynamic, partial obs., memory	Succ.Rate, efficiency, structured experience collaboration

These benchmarks collectively provide a comprehensive infrastructure for the empirical and theoretical paper of collaborative, multi-agent, distributed, and language-mediated maze-solving, driving innovation in both foundational algorithms and practical, real-world AI and robotics deployments.