CREW-WILDFIRE: Benchmarking Agentic Multi-Agent Collaborations at Scale

Published 7 Jul 2025 in cs.MA and cs.AI | (2507.05178v1)

Abstract: Despite rapid progress in LLM-based multi-agent systems, current benchmarks fall short in evaluating their scalability, robustness, and coordination capabilities in complex, dynamic, real-world tasks. Existing environments typically focus on small-scale, fully observable, or low-complexity domains, limiting their utility for developing and assessing next-generation multi-agent Agentic AI frameworks. We introduce CREW-Wildfire, an open-source benchmark designed to close this gap. Built atop the human-AI teaming CREW simulation platform, CREW-Wildfire offers procedurally generated wildfire response scenarios featuring large maps, heterogeneous agents, partial observability, stochastic dynamics, and long-horizon planning objectives. The environment supports both low-level control and high-level natural language interactions through modular Perception and Execution modules. We implement and evaluate several state-of-the-art LLM-based multi-agent Agentic AI frameworks, uncovering significant performance gaps that highlight the unsolved challenges in large-scale coordination, communication, spatial reasoning, and long-horizon planning under uncertainty. By providing more realistic complexity, scalable architecture, and behavioral evaluation metrics, CREW-Wildfire establishes a critical foundation for advancing research in scalable multi-agent Agentic intelligence. All code, environments, data, and baselines will be released to support future research in this emerging domain.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces CREW-WILDFIRE, an open-source benchmark evaluating LLM-based multi-agent collaborations using procedurally generated wildfire simulations.
It employs heterogeneous agents such as firefighters, drones, bulldozers, and helicopters across 12 test levels to assess coordination, task delegation, and adaptive planning.
Results show that while existing frameworks perform well on simpler tasks, they struggle with complex, dynamic scenarios that demand improved communication and decentralized planning.

CREW-WILDFIRE: Benchmarking Agentic Multi-Agent Collaborations at Scale

Introduction

The paper introduces CREW-WILDFIRE, an open-source benchmark specifically crafted to evaluate large-scale, LLM-based multi-agent systems. This benchmark addresses critical gaps in existing evaluations, which often emphasize smaller and less complex domains. The CREW-WILDFIRE environment features procedurally generated wildfire response scenarios, incorporating heterogeneous agents, partial observability, stochastic dynamics, and long-horizon planning objectives. The study highlights significant performance gaps in current state-of-the-art frameworks, underscoring challenges in coordination, communication, and planning in dynamically complex environments.

Figure 1: Features procedurally generated environments, an LLM-compatible multi-agent framework, and heterogeneous agents designed to evaluate Agentic collaborations at scale.

Environment Design

The CREW-WILDFIRE environment leverages the CREW platform, providing a procedural generation of realistic wildfire scenarios. This includes varying terrain features generated via Perlin noise and detailed wildfire simulations using cellular automata models. The environment supports flexible interfacing with LLMs for both low-level control and high-level reasoning.

Figure 2: An overview of the CREW Wildfire Framework.

Heterogeneous Agents

The benchmark includes four types of agents: Firefighters, Bulldozers, Drones, and Helicopters, each with distinct capabilities requiring cooperative strategies. These roles mimic real-world fire crew specializations, demanding inter-agent dependencies that make coordination challenging.

Figure 3: The Four Heterogenous Agent Types: Firefighters, Bulldozers, Drones, and Helicopters.

Benchmarking Suite

The evaluation suite introduces 12 levels, each designed to test different competencies through diverse objectives. These levels vary in complexity, team composition, and task difficulty, providing a comprehensive testbed for assessing coordination, spatial reasoning, task delegation, and adaptability.

The suite also establishes a set of behavioral goals, such as Task Designation and Plan Adaptation, to diagnose conceptual challenges rather than relying solely on scoring.

Figure 4: Sample Multi-Agent Trajectories for the Cut Trees: Sparse(small) level. Each highlighted section represents the correct tree being cut by that agent.

Results and Analysis

The experimental results demonstrate that current frameworks excel in simplified tasks but struggle with larger, dynamic, and complex scenarios. Frameworks such as CAMON showed effective task designation but failed in decomposing and adapting plans in more intricate scenarios.

The analysis of communication efficiencies revealed significant scaling challenges, where token usage increases linearly with agent count, indicating the need for decentralized or hierarchical communication architectures.

Figure 5: API calls, input tokens, and output tokens per timestep across agent counts.

Behavioral Competency

By introducing the Behavior-Competency Score (BCS), the paper evaluates the frameworks on task-related behaviors. While frameworks demonstrate strengths in Realtime Coordination and Task Designation, gaps remain in Observation Sharing and Plan Adaptation. This metric provides more granular insights into behavioral competencies by associating specific actions with high-level goals.

Conclusions

CREW-WILDFIRE provides a crucial platform for challenging current LLM-based multi-agent approaches, emphasizing real-world complexity and scalability. By identifying clear performance and competency gaps, the benchmark sets the stage for developing more robust, scalable, and adaptable multi-agent systems.

The release of the benchmark and associated tools is intended to foster advancements in LLM-based multi-agent systems, pushing the boundaries of current capabilities and setting a new standard for future research. This benchmark not only aids in performance evaluation but also serves as a foundation for further exploration of agent coordination, strategic planning, and adaptive behaviors in large-scale scenarios.

Markdown Report Issue