PillagerBench: Benchmarking LLM Agents in Minecraft

Updated 14 September 2025

PillagerBench is a benchmarking framework evaluating LLM-based agents in competitive, multi-agent Minecraft scenarios with modular APIs and standardized testing protocols.
It supports multi-round experiments where agents adapt strategies over time using built-in rule-based opponents and detailed environmental data.
The platform fosters open-source collaboration by offering reproducible experiments and transparent metrics to advance multi-agent AI research.

PillagerBench is a benchmarking framework for evaluating LLM-based agents in competitive, team-vs-team scenarios within Minecraft. It defines a standardized, extensible infrastructure for reproducible experimentation with multi-agent systems, providing built-in rule-based opponents, multi-round testing, and modular APIs that facilitate fair comparisons in adversarial and dynamic game environments. PillagerBench is open-sourced to foster broader community engagement and standardized evaluation of collaborative and competitive agent capabilities (Schipper et al., 7 Sep 2025).

1. Framework Architecture and API

PillagerBench establishes a modular architecture that centers around a multi-agent API split into three phases:

Pre-game: Access and provision of scenario metadata, team assignment, and initial world state.
Game: Agents execute high-level JavaScript actions in Minecraft via the Mineflayer interface. Each agent interacts through the API, which enables both perception and actuation primitives.
Post-game: Collection of logs, statistics, and outcome metrics for debriefing and analysis.

The architecture comprises three main modules:

Benchmark Module: Spawns Minecraft servers, manages Mineflayer agent instances, and leverages Docker for scalable, isolated deployment. This ensures consistent test conditions across experiments.
Bridge Module: Orchestrates the communication between agents and the Minecraft environment, mediating message exchange and control instructions.
Environment Module: Supplies detailed state representations (such as block states, chat logs, inventories) necessary for rich agent perception and decision-making.

2. Multi-Round and Multi-Episode Testing

PillagerBench supports multi-round (multi-episode) experiments, where state can persist and propagate between episodes. This design enables agents to adapt and refine strategies over time, a critical feature for evaluating learning and generalization in competitive settings. Multi-round testing determines if systems demonstrate cumulative learning, strategic evolution, and robustness under sustained adversarial conditions.

Agents may update internal models based on historical episode data (denoted $H$ ), facilitating progressive tactics and counter-strategy adaptation—a process explicitly supported within TactiCrafter’s architecture and formalized by repeated mappings:

$T = LLM(p_b, D, H, \mathcal{G}, \mathcal{O})$

where $T$ is the tactics, $D$ is the scenario data, $\mathcal{G}$ is the causal graph, $\mathcal{O}$ is inferred opponent data, and $p_b$ is the update prompt.

3. Built-In Opponent Modeling and Scenario Definition

The framework provides rule-based opponents exhibiting fixed strategic behaviors, e.g., sabotage, defense, or cooperation within defined scenarios such as Mushroom War. Each opponent’s policy is codified to support reproducibility—a user’s agent is benchmarked against identical adversary logic across runs.

Scenarios are described declaratively in YAML and managed by Hydra, allowing precise configuration of game mode, objectives, agent roles, environmental parameters, and evaluation criteria. This ensures experimental consistency and comparability of metric outcomes.

Feature	Mechanism	Role in Benchmarking
Rule-based Opponents	Pre-coded sabotage/co-op	Establish fixed reference
YAML Scenario Config	Hydra-driven codegen	Reproducible environment
Dockerized Deployment	Isolated server instances	Fairness across runs

4. TactiCrafter: LLM-Based Multi-Agent System

TactiCrafter is an LLM-powered multi-agent system implemented atop PillagerBench. Its team-oriented architecture synthesizes strategy through four interdependent modules:

Tactics Module: Generates high-level, human-readable tactics via LLM prompting, ensuring team coordination. Key operation involves initialization and continual episodic update:
- Initial episode: $T = LLM(p_a, D, \mathcal{G})$
- Subsequent episodes: $T = LLM(p_b, D, H, \mathcal{G}, \mathcal{O})$
Causal Model: Learns and refines a causal graph for agent control primitives and environment interactions; updates observed dependencies:
- Initial causal graph: $\mathcal{G} = LLM(p_c, D)$
- Episodic refinement: $\mathcal{G}' = LLM(p_d, D, H, \mathcal{G}); \mathcal{G} \gets \mathcal{G} \cup \mathcal{G}'$
Opponent Model: Infers opponent tactics from chat logs and behavioral traces, formulating a tactics hypothesis:
- $\mathcal{O} = LLM(p_e, D, H, \mathcal{G})$
Base Agents: Execute team tactics by iterative code generation and environment interaction; error handling and self-critique are embedded through prompt-based feedback mechanisms:
- Action code: $\mathcal{A} = LLM(p_f, D, H, \mathcal{C}, T)$
- Critique: $\mathcal{C} = LLM(p_g, D, H, T)$

This structure facilitates adaptive learning through self-play, as TactiCrafter integrates episodic feedback and improves both causal understanding and tactical prediction over multiple rounds.

5. Evaluation Metrics and Experimental Protocol

Benchmarked performance is reported across several axes:

Win Rate: Fraction of episodes won against rule-based opponents for each team configuration.
Adaptive Learning Progress: Analysis of improvements in tactics and causal modeling over repeated episodes.
Strategic Synergy: Quantified via Coordination Metrics (e.g., task completion, resource allocation efficiency).
Opponent Adaptation: Success in countering shifting opponent strategies, measured through post-game logs and behavioral divergence statistics.

Experiments are designed to maintain identical initial conditions for all agents. The fairness and reproducibility of testing are reinforced by Docker isolation, YAML scenario standardization, and fixed opponent logic.

6. Open Source Release and Community Impact

PillagerBench is open-sourced to support transparency, method sharing, and broad adoption for competitive multi-agent research. Researchers can reproduce results via supplied configurations and contribute novel extensions (e.g., new scenario modules, causal learning plug-ins, opponent modeling improvements). PillagerBench is positioned to serve as a common testbed for strategic reasoning in adversarial multi-agent environments, analogous to canonical RL benchmarks.

Open sourcing also enables collaborative bug fixing, rapid feature enhancement, and standardized evaluation protocols, addressing reproducibility challenges prevalent in multi-agent AI research.

7. Relation to Prior Minecraft Benchmarks and Potential Extensions

PillagerBench departs from builder-oriented synthetic benchmarks, such as those in (Madge et al., 17 Jul 2024), by emphasizing competitive multi-agent settings over spatial reasoning or construction. Whereas the builder-agent benchmarks focus on block placement tasks, spatial math, and dialog-based instructions, PillagerBench extends the scope to dynamic, adversarial challenge domains (including sabotage, resource contention, and real-time tactics).

A plausible implication is that future research could merge spatial reasoning and competitive strategy benchmarks, incorporating agent interactions with hostile entities (e.g., pillagers), environmental threats, and composite objectives. This integration would further challenge LLM reasoning across both geometric and competitive dimensions.

PillagerBench defines a rigorous, extensible methodology for evaluating collaborative and adversarial multi-agent systems under competitive constraints, establishing new standards for fairness, reproducibility, and strategic adaptation in Minecraft-based AI research (Schipper et al., 7 Sep 2025).

PDF Markdown Chat (Pro)

References (2)

PillagerBench: Benchmarking LLM-Based Agents in Competitive Minecraft Team Environments (2025)

A LLM Benchmark based on the Minecraft Builder Dialog Agent Task (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to PillagerBench.