BuildArena: LLM Construction Benchmark

Updated 21 October 2025

BuildArena is a modular interactive benchmark that evaluates LLM-driven construction automation through physics simulations and structured tasks.
It integrates customizable natural language task definitions, LLM-based multi-step planning, and a simulation engine to assess precision and robustness.
The platform supports multi-agent iterative workflows and quantitative metrics for comparing performance in static and dynamic engineering scenarios.

BuildArena is a physics-aligned interactive benchmark designed specifically to evaluate the capabilities of LLMs in automating engineering construction through natural language specifications. The benchmark integrates customizable tasks, a structured agentic workflow, a simulation-based evaluation environment, and a dedicated spatial geometric computation library to rigorously assess LLM proficiency in physical, mechanical, and planning-centric reasoning. BuildArena's modular architecture allows for systematic exploration of both creative and technically precise solutions in construction automation, providing extensive support for analysis and model comparison across multiple operational dimensions (Xia et al., 18 Oct 2025).

1. Modular Benchmarking Framework

BuildArena employs a modular benchmarking framework comprised of three primary components:

Task Definition Layer: Users define tasks using natural language, specifying constraints and objectives. This layer supports full customization, permitting the creation of physically meaningful scenarios that capture a range of engineering requirements.
LLM-Based Construction Workflow: LLM agents interpret instructions and synthesize sequential building plans, producing structured commands that reflect multi-step reasoning under physical constraints.
Simulation-Based Evaluation: The system evaluates constructed objects in a physics simulation (utilizing Besiege’s engine), recording metrics such as displacement, load tolerance, thrust-to-weight ratio (TWR), and maximum structure height.

The framework's modular nature facilitates head-to-head model comparison, fine-grained error analysis, and extensibility for new task types or domains.

2. Task Design: Static and Dynamic Engineering Scenarios

BuildArena's task suite is built upon an extendable strategy that covers both static and dynamic mechanical principles. Three major categories are implemented:

Category	Physical Principle	Sample Evaluation Metric
Vehicle Construction	Dynamic horizontal motion	Maximum displacement under payload
Bridge Construction	Static stability	Load-bearing capacity, span width
Rocket Construction	Dynamic vertical motion	Thrust-to-weight (TWR), max altitude

Each category is divided into three difficulty tiers (Easy, Medium, Hard), quantitatively evaluated along six engineering dimensions: Quantification, Robustness, Magnitude, Compositionality, Precision, and Ambiguity. Increasing tier complexity demands higher accuracy, more elaborate modular assembly, and increased tolerance for ambiguous and large-scale specifications.

3. 3D Spatial Geometric Computation Library

A core architectural element, BuildArena’s open-source spatial computation library, translates LLM-generated plans into executable physical actions:

State Representation: The build state is formalized as $S = \langle V, \mathcal{P}, c \rangle$ , where $V$ is the set of modules, $\mathcal{P} : V \rightarrow SE(3)$ maps modules to poses (including position, orientation, and scale in 3D space), and $c$ is the control/action sequence.
Constraint Handling: The library enforces geometric, spatial, and physical constraints (e.g., preventing collisions, maintaining attachment point occupancy). Invalid actions yield precise error codes and feedback, supporting iterative model learning and plan refinement.
Closed-Loop Integration: This library operates in tandem with the simulation engine, allowing real-time state update and immediate error propagation for agent guidance.

4. Multi-Agent, Multi-Turn Agentic Workflow

BuildArena’s evaluation protocol is structured as a multi-agent conversation, enabling detailed reasoning and error correction. The workflow components include:

Planner Agent ("P"): Generates the overall construction blueprint from the prompt.
Draft–Review Loop ("D" and "R"): Iteratively refines and verifies detailed schematics through back-and-forth agent dialog.
Build–Guidance Loop ("B" and "G"): Translates the approved schematic into a sequenced control trajectory $\bar{a} = \langle (a_1, r_1), (a_2, r_2), ..., (a_T, r_T) \rangle$ ; each action $a_i \in \mathcal{A}$ is paired with return data $r_i$ (success/error).

This coarse-to-fine approach mimics actual engineering processes—moving from high-level designs to granular assembly—with continual feedback and revision cycles.

5. Quantitative Evaluation and Frontier Model Performance

BuildArena was deployed to test eight state-of-the-art LLMs (GPT-4o, Claude-4, Grok-4, Gemini-2.0, DeepSeek-3.1, Qwen-3, Kimi-K2, Seed-1.6) across all categories and difficulty levels. Evaluation metrics include:

Precision: Accuracy of geometric placement and module orientation.
Robustness: Single-point failure tolerance and stability under perturbations.
Magnitude and Ambiguity: Capacity to handle larger scale and imprecise instructions.
Compositionality: Effectiveness at modular, hierarchical assembly.

Notable findings include Grok-4 outperforming in tasks requiring high Precision and Robustness. However, all models exhibited a general performance decline with increased difficulty, especially in compositional and precision-demanding scenarios. The framework also observed that excessive token usage does not linearly correlate with improved outcomes, indicating efficiency trade-offs in plan generation.

6. Mathematical Formalization and Performance Metrics

Several mathematical models underpin the evaluation process:

State and Control Sequences: $S = \langle V, \mathcal{P}, c \rangle$ , $\bar{a} = \langle (a_1, r_1), ..., (a_T, r_T) \rangle$ .
Physical Success Criteria: In rocket tasks, the system uses thrust-to-weight ratio (TWR), where $TWR > 1$ denotes a physically viable takeoff.
Error Logging and Token Analysis: The framework records detailed error reasons, token counts, and performance distributions across conversation turns and builds.

These formal representations are essential for reproducible benchmarking, quantitative agent comparison, and systematic identification of failure modes.

7. Project Resources and Community Extension

The BuildArena project page (https://build-arena.github.io/) provides:

Full source code for all benchmarking components and geometric computation libraries.
Documentation on module specifications and action schemas.
Experiment archives for all tested models, including cost analyses, failure logs, and conversation traces.
Open invitation for community submission of new modules, task types, and evaluation protocols.

BuildArena's extensibility and documentation enable ongoing community-driven growth and facilitate the rigorous evaluation of future-generation LLMs in language-driven engineering construction.

BuildArena represents a comprehensive platform for probing the intersection between natural language processing capabilities and physically-grounded construction automation. Its integration of interactive agentic workflows, physical simulation, and precise geometric modeling establishes a foundation for future research bridging LLMs with practical engineering domains (Xia et al., 18 Oct 2025).

Markdown Upgrade to Chat

References (1)

BuildArena: A Physics-Aligned Interactive Benchmark of LLMs for Engineering Construction (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BuildArena.