AEC-Bench: A Multimodal Benchmark for Agentic Systems in Architecture, Engineering, and Construction

Published 31 Mar 2026 in cs.AI | (2603.29199v1)

Abstract: The AEC-Bench is a multimodal benchmark for evaluating agentic systems on real-world tasks in the Architecture, Engineering, and Construction (AEC) domain. The benchmark covers tasks requiring drawing understanding, cross-sheet reasoning, and construction project-level coordination. This report describes the benchmark motivation, dataset taxonomy, evaluation protocol, and baseline results across several domain-specific foundation model harnesses. We use AEC-Bench to identify consistent tools and harness design techniques that uniformly improve performance across foundation models in their own base harnesses, such as Claude Code and Codex. We openly release our benchmark dataset, agent harness, and evaluation code for full replicability at https://github.com/nomic-ai/aec-bench under an Apache 2 license.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces AEC-Bench to evaluate multimodal agentic systems in real-world AEC workflows, emphasizing coordination across textual, visual, and spatial domains.
It employs controlled task-specific artifact injection across three context scopes to rigorously assess retrieval accuracy and structured parsing benefits.
Empirical results reveal that while structured representations boost retrieval performance, challenges remain in spatial grounding and domain-specific judgment.

AEC-Bench: Evaluating Multimodal Agentic Systems in Architecture, Engineering, and Construction

Benchmark Motivation and Domain Challenges

The AEC-Bench is introduced to systematically evaluate agentic systems tasked with real-world coordination in the Architecture, Engineering, and Construction (AEC) domain, a context dominated by dense multimodal artifacts and workflow-driven reasoning. Unlike the structured environments of software engineering agent benchmarks, AEC workflows demand multimodal understanding, with information distributed across textual, visual, and spatial formats (Figure 1). Retrieval limitations and compounding context errors persist due to the inadequacy of conventional tools such as linear text extraction and vision models, which frequently fail to preserve geometric and spatial relationships critical for construction-document reasoning.

Figure 1: Example of a complex construction drawing PDF page with dense annotations, callouts, and intricate linework, illustrating the visual complexity encountered in real-world AEC coordination.

These challenges motivate the development of a benchmark built around three context scopes—Intra-Sheet, Intra-Drawing, and Intra-Project—mirroring the real taxonomy of coordination tasks. Each scope exhibits distinct agentic and multimodal reasoning demands: single-page spatial correlation, cross-sheet traversal, and project-document alignment, respectively. The benchmark is constructed from publicly available construction documents, with controlled injection of task-specific artifacts that maintain high visual and context fidelity.

Benchmark Construction and Task Taxonomy

AEC-Bench comprises 196 instances across 9 task families, curated by domain experts to emulate common coordination workflows and supported by automated verifiers. Tasks are grouped as follows:

Intra-Sheet: Localized spatial reasoning tasks such as callout validation and detail-title accuracy.
Intra-Drawing: Cross-sheet reference and consistency checks, including tracing references and verifying sheet indices.
Intra-Project: Project-level alignment tasks spanning drawings, specifications, and submittals, necessitating integration of distributed context.

The controlled injection of artifacts enables rigorous evaluation of agent retrieval and grounding behaviors. Cross-sheet and cross-document navigation is paramount, often requiring agents to traverse and reconcile information that is visually and spatially dispersed.

Figure 2: Cross-reference resolution (before), representing a typical intra-drawing coordination task where resolving references demands navigation and visual integration across multiple sheets.

Baseline Harnesses, Tools, and Experimental Setup

Evaluation is conducted on two agent harness families: Codex and Claude Code. In the base condition (H), agents are provided with Bash shell access, standard PDF CLI utilities, and are free to code intermediate artifacts. The harness augmented with Nomic tools (H+\nomicN) provides agents with structured representations—including extracted text, spatial layouts, and reference relationships—generated via domain-specific parsing pipelines such as Nomic Parse and Nomic Embed (Nussbaum et al., 2024).

Baseline analysis isolates the contribution of structured representations versus raw document access, holding task and environment constant. This allows precise measurement of agent performance in retrieval and visual grounding as shaped by harness design. Output is strictly graded by correctness/completeness of structured findings, rather than intermediate tool usage or trajectory details.

Empirical Results and Performance Analysis

Performance is quantified as mean reward per task, directly reflecting accuracy and completeness against ground truth. The introduction of structured notions of context via parsing produces pronounced gains in retrieval-heavy tasks:

Detail-Technical-Review: Augmentation yields average +32.2% improvement across models.
Spec-Drawing-Sync and Drawing-Navigation: Gains of +20.8% and +18.75%, respectively.

These improvements are consistent across foundation models, including GPT-5.2/5.4 and Claude Sonnet/Opus. Retrieval emerges as the primary bottleneck; once relevant context is located, agentic reasoning capabilities are considerably stronger.

However, parsing provides targeted, not uniform, benefits. On tasks requiring spatial grounding—such as note-callout-accuracy or cross-reference-tracing—performance can decrease marginally. These tasks demand precise localization and text-to-geometry alignment that parsing pipelines do not sufficiently address; visual evidence must be spatially resolved and contextually interpreted (as evident in Figure 2). Furthermore, submittal-review tasks demonstrate consistently low scores, indicating an unresolved gap in engineering judgment and prioritization, even with improved parsing.

Agents operating in multimodal construction environments typically default to coding-oriented interactions (shell commands, text extraction) with Codex agents relying entirely on Bash. Nearly all agent trajectories invoke pdftotext for extraction, but this collapses the spatial layout, thereby discarding essential geometric relationships. Increased interaction with image rendering utilities (e.g., pdftoppm) does not translate into improved spatial grounding, reinforcing the need for improved agent design that leverages spatial and geometric cues.

Critical and Contradictory Findings

Coding agent harnesses transfer well to retrieval-centric AEC tasks: Structured parsing enables foundation models to outperform on retrieval-focused benchmarks.
Visual grounding deficits persist: Parsing and text extraction pipelines are ineffective for tasks requiring spatial alignment, leader tracing, or geometric reasoning.
Judgment-heavy coordination tasks remain challenging: Submittal-review exemplifies model limitations in applying domain-specific judgment and prioritization, independent of document structure or parsing availability.
Tool-use patterns reinforce modality collapse: Despite multimodal context, tool invocation overwhelmingly converges on linear text extraction and image rendering; spatial information is underutilized.

Theoretical and Practical Implications

These findings clarify the limits of agentic systems in document-centered AEC workflows. Multimodal agent benchmarks such as AEC-Bench are essential for comprehensively characterizing agent ceiling effects, especially where context engineering and multimodal retrieval dictate performance. Structured representations yield tangible gains in retrieval, but alone cannot resolve the requirements imposed by visual grounding or domain judgment.

Future development must focus on:

Enhanced document navigation systems: Iterative exploration, spatial memory, and evidence region retrieval prior to reasoning.
Improved evaluation protocols: Including verifiers capable of assessing evidence grounding and intermediate reasoning steps.
Domain-aware multimodal modeling: Integrating engineering judgment and workflow context into agent scaffolding and orchestration.

Scaling the dataset and task families will facilitate more robust benchmarking and expose granularity in agent weaknesses.

Conclusion

AEC-Bench establishes a rigorous, multimodal framework for evaluating agentic reasoning in the AEC domain. It demonstrates that coding-agent harnesses generalize to retrieval-heavy tasks, but visual grounding and judgment remain limiting factors. Parsing and structured representation are beneficial but not fundamentally sufficient. Domain-aware agent orchestration and iterative document navigation represent essential future directions. The benchmark and open-source infrastructure facilitate replicable research and provide a foundation for advancing agentic systems within visually and contextually complex environments.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

AEC‑Bench: A simple explanation for teens

What is this paper about?

This paper introduces AEC‑Bench, a big “test” for AI assistants that try to help with real construction work. AEC stands for Architecture, Engineering, and Construction. In this field, people use large sets of 2D drawings and documents (think: very detailed, very crowded maps of buildings) to plan and build things like bridges and offices. The authors built AEC‑Bench to see how well different AI “agents” can read these drawings, find the right info across many pages and files, and report what they discover.

What questions are the researchers asking?

Here are the main questions in simple terms:

Can today’s coding-style AI agents (robot interns that can use tools and write small programs) handle real construction documents?
Do special tools that break drawings into more structured, searchable pieces help these agents do better?
Which kinds of tasks are easy or hard for AI agents: single‑page checks, multi‑page checks, or cross‑document checks?
Where do these agents fail, and what tools or designs help fix those failures?

How did they do it?

Think of a construction project like a giant treasure hunt:

Each drawing page is like a dense map with symbols, labels, and arrows pointing to other pages.
The “agent” is a robot intern that can open files, search text, render images, run simple code, and then write a neat report of what it found.

What the team built:

A benchmark (AEC‑Bench) with 196 small, realistic tasks grouped by how much context you need:
- Intra‑Sheet: tasks solvable from one page (e.g., “does this label match the thing it points to?”).
- Intra‑Drawing: tasks that require hopping across several pages in one drawing set (e.g., “does this reference number point to a real detail on another page?”).
- Intra‑Project: tasks across different document types (e.g., do the drawings and the written specs disagree?).

How tasks work:

Each task gives the agent instructions and access to real project PDFs in a safe “sandbox.”
The agent explores, uses tools (like turning PDFs into text or images), and outputs answers in a structured file (JSONL).
An automatic checker (and domain experts) grade the final answers for correctness and completeness, giving a score from 0 to 100.

How the dataset was made:

The team took real public construction PDFs and used a semi‑automated process to:
- Extract structured info (text locations, geometry, cross‑references).
- Carefully inject tiny, realistic “gotchas” (like a wrong reference number) without ruining the look.
- Verify the edits visually and with text checks to keep things realistic and fair.

What they compared:

Two popular coding‑agent “harnesses” (think: the agent’s workbench that provides tools and rules)—similar to Claude Code and Codex.
Two setups:
- H: just the base tools (command line, PDF utilities, etc.).
- H + Nomic: base tools plus special “parsing” and embeddings from Nomic that give more structured access to the drawings (where text is, how pages link, etc.).
They measured how much the added structure helps.

What did they find, and why is it important?

Key takeaways:

General coding agents transfer somewhat well: They can do a fair amount, especially when the job is mostly about searching and organizing information.
Finding the right context is the biggest bottleneck: Agents often fail not because they can’t reason, but because they can’t find the correct page, detail, or file. When they do find it, they do much better.
Structured parsing helps on “find‑it” tasks:
- With the extra structured tools, scores improved a lot on tasks like technical detail review, matching specs to drawings, and navigating to the right file/sheet/detail.
But parsing doesn’t fix everything—visual grounding is still hard:
- Tasks that require precise visual understanding—like following arrows (leader lines) to the exact item on a busy drawing or checking that a note truly points to the right element—did not improve and sometimes got worse. It’s not enough to have text; the agent needs to understand spatial relationships in the drawing.
Judgment-heavy tasks remain tough:
- Submittal review (checking if a contractor’s submitted product meets drawings and specs) stayed low across all models and setups. It requires professional judgment and careful prioritization, not just retrieval.
Agents default to “code-first” habits:
- Most agents heavily used tools that turn PDFs into plain text (like pdftotext). This flattens the drawing’s spatial structure, losing important geometric context. Even when they rendered images, they rarely turned that visual info into precise, location-aware answers.

Why this matters:

In real projects, delays and cost overruns often come from mismatched documents and missed details. An AI assistant that’s good at finding and cross-checking the right info could save time and money.
The results show we need better tools that help agents keep and use the drawing’s spatial structure, not just the text.

What could this change in the future?

Better AI assistants for AEC: With improved tools that understand both text and drawing layout, agents could help professionals catch mistakes earlier and coordinate teams more smoothly.
Smarter system design: Success depends on how visual info, text, and tools are combined—not just having a powerful model. Domain‑aware orchestration (the way tools and steps are arranged) matters a lot.
Stronger evaluations: Testing agents in realistic, workflow‑like settings gives a truer picture of what they can actually do on the job.

The benchmark, code, and tools are open-source (Apache 2 license), so others can reproduce results and build on this work.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of concrete gaps and unresolved questions the paper leaves open, organized to guide future research.

Dataset scope and representativeness

Limited scale (196 instances) and document diversity; unclear how results generalize across larger projects, more sheets per set, and additional AEC disciplines and asset types.
Injected/controlled defects may not reflect the full spectrum, frequency, and ambiguity of naturally occurring coordination issues; performance on “in-the-wild” errors remains unknown.
No stratified analysis by discipline (architectural, structural, MEP, civil) or drawing type (plans, sections, details) to identify domain- or artifact-specific failure modes.
Scanned vs. vector PDFs are not distinguished; the impact of OCR noise, rasterization quality, and vector availability on task performance is unquantified.
Absence of a train/dev/test split or “challenge” sets (e.g., noisy scans, low-contrast annotations, dense callouts) to systematically probe robustness and difficulty.

Evaluation protocol and metrics

Single-trial evaluation per instance; no variance estimates, confidence intervals, or statistical significance testing across runs or seeds.
Deterministic, keyword/structure-based verifiers do not assess evidence localization, spatial grounding quality, or reasoning faithfulness; lack of metrics that require evidence pointers (e.g., bounding boxes, sheet coordinates) and justification.
Judgment-heavy tasks (e.g., submittal review) are evaluated with static verifiers despite acknowledged subjectivity; no inter-rater reliability, human-in-the-loop assessment, or rubric-calibrated scoring.
No precision/recall, calibration, or cost-of-errors analysis on multi-finding tasks where false positives are operationally harmful.
Lack of human baselines (time, accuracy) to contextualize agent performance and difficulty calibration.
No reporting on task-time, tokens, or compute/latency budgets; efficiency and cost-performance trade-offs remain unmeasured.

Agent/harness design and orchestration

Only two coding-agent harness families (Codex, Claude Code) are evaluated; effects of alternative agent paradigms (e.g., ReAct, Planner-Executor, verifier-in-the-loop, multi-agent decomposition) are unknown.
Agents defaulted to CLI-centric, pdftotext-driven strategies; it is unclear whether better tool affordances, prompting, or orchestration can shift behavior toward spatially grounded methods.
No ablation on tool availability or instruction strategies to test if agents can be nudged to use structured tools (e.g., “read with regions,” geometric parsers) over linear text extraction.
Outcome-only scoring obscures process-level failure analysis; lack of fine-grained trajectory evaluation (tool selection, exploration coverage, backtracking) limits insight into harness weaknesses.

Representation and perception for drawings

Parsing benefits retrieval but not spatial grounding; the benchmark does not test or compare richer geometric/semantic representations (e.g., vector primitives, symbol libraries, topology graphs, CAD/BIM exports like DWG/IFC).
No controlled study on image resolution, DPI, or rendering strategies and their effect on visual-spatial tasks (e.g., detecting leader lines, symbol associations).
Absence of specialized perception modules (symbol detectors, line/leader/arrow parsers, OCR tuned for drawings); impact of integrating these is untested.
Cross-sheet/link extraction is provided, but the completeness and error rates of parsed reference graphs (and their effect on tracing tasks) are not quantified.

Task design and coverage

Narrow set of task families; many high-value AEC checks (e.g., dimension/scale consistency, units conversions, code compliance crosswalks, BOM/quantity takeoff alignment, change/version tracking) are not assessed.
Intra-project tasks focus on text-centric sync; tasks requiring numerical reasoning with units, tolerances, and scale bars are underrepresented.
No incremental/hierarchical tasks to test long-horizon exploration (e.g., “find, then verify, then summarize with evidence”) or curriculum effects.

Generalization, robustness, and transfer

Generalization to proprietary or larger-scale enterprise project repositories (with deeper hierarchies, naming conventions, and nonstandard indexing) is untested.
Cross-language/locale robustness (e.g., multilingual notes, region-specific symbols and codes) is not evaluated.
Sensitivity to project metadata conventions (sheet naming, index styles, title block formats) is not analyzed.

Safety, reliability, and usability

Over-generation and false positives are observed (e.g., submittal review), but there is no exploration of uncertainty estimation, abstention policies, or confidence-calibrated reporting to mitigate operational risk.
No study of human-in-the-loop workflows (triage, verification, correction) or how agent outputs integrate with industry formats (e.g., BCF/issue trackers) and practitioner tooling.
Lack of error taxonomies and actionable diagnostics that practitioners could use to trust or correct agent findings.

Reproducibility and fairness

Results depend on proprietary foundation models; replicability across model updates or with open-source models is not addressed.
The “H+Nomic” augmentation introduces domain-specific tools; fairness and comparability to alternative structured parsers or third-party toolchains are not investigated.
Minor inconsistencies between narrative (expert grading) and practice (automated verifiers) create ambiguity about evaluation reproducibility and human oversight.

Open methodological questions

What representations and toolchains most effectively couple retrieval with spatial grounding (e.g., hybrid vector-graph + image + text pipelines)?
How should verifiers be designed to jointly assess correctness, evidence localization, and reasoning fidelity at scale?
Which orchestration strategies best manage long-horizon document navigation (e.g., spatial memory, map-building, frontier exploration) and reduce compounding retrieval errors?
Can domain-tuned training (e.g., instruction tuning on AEC tasks, RL with verifiers, retrieval-aware multimodal pretraining) materially improve grounding and judgment-heavy performance?
How can benchmarks better reflect real coordination workflows, including iterative review cycles, uncertainty management, and graded severity/priority of findings?

View Paper Prompt View All Prompts

Practical Applications

Below is an overview of practical applications that follow from AEC-Bench’s findings, methods, and tooling (e.g., Harbor harness, structured parsing via Nomic Parse/Embeddings). The items emphasize where the benchmark shows clear, deployable value (retrieval- and structure-heavy tasks) and where longer-term R&D is needed (spatial grounding and judgment-heavy tasks).

Immediate Applications

Industry (AEC): Benchmark-driven vendor evaluation and procurement guardrails
- Use AEC-Bench as an acceptance test and SLA backbone in RFPs for AI document assistants, with task-category thresholds (e.g., drawing navigation and spec–drawing sync) derived from reported reward scores.
- Tools/workflows: Harbor harness; AEC-Bench dataset; automated verifiers; reproducible containers.
- Assumptions/dependencies: Access to representative project documents; ability to run containerized evaluations; acceptance by procurement teams.
Industry (AEC) / Software: Retrieval-first document navigator for drawings
- Deploy an agent that reliably locates the correct file/sheet/detail given a query, where H+structured parsing showed near-perfect gains (e.g., drawing-navigation).
- Tools/workflows: Nomic Parse/Embeddings (or equivalent) to build cross-sheet indices; integration with Procore, Autodesk Construction Cloud, Bluebeam.
- Assumptions/dependencies: Indexing of all PDFs and sheet metadata; robust PDF parsing; permissions and APIs for DMS integration.
Industry (AEC): Spec–drawing synchronization checker
- Automate detection of conflicts between specifications and drawings; parsing improved performance on this retrieval-sensitive task.
- Tools/workflows: Structured parsing pipelines for both specs and drawings; evidence-grounded JSON reports reviewed by engineers.
- Assumptions/dependencies: Consistent access to current spec books and drawings; policy for human-in-the-loop approval; traceability requirements.
Industry (AEC): Sheet index and title block auditor
- Validate sheet index entries vs. title blocks for mismatches (high baseline performance suggests easy wins).
- Tools/workflows: Batch PDF text/layout extraction; scheduled audits with discrepancy dashboards.
- Assumptions/dependencies: Reliable OCR/layout extraction for title blocks; consistent title block formats.
Industry (AEC): Cross-reference resolution scanner
- Identify broken or non-resolving cross-references across a drawing set (moderate performance; use with human review).
- Tools/workflows: Parse relationships between callouts and target details; flag suspect references for manual confirmation.
- Assumptions/dependencies: Correctly detected reference schema; tolerance for false positives and reviewer workflows.
Academia / Software R&D: Harness- and tool-ablation research baseline
- Use AEC-Bench to study how representation and orchestration (e.g., structured parses vs. CLI-only) affect agent performance.
- Tools/workflows: Harbor harness; controlled tool-access experiments; shared leaderboards and reproducible code.
- Assumptions/dependencies: Compute resources; standard reporting of reward metrics by task.
Education (AEC programs): Multimodal document literacy labs with auto-grading
- Train students and junior engineers on real review workflows (cross-sheet navigation, detail checks) using AEC-Bench tasks and verifiers.
- Tools/workflows: Classroom CI pipelines; automated feedback on JSON outputs; curated task subsets per course module.
- Assumptions/dependencies: Appropriately licensed documents; instructor capacity to curate tasks.
Software Engineering (MLOps): CI/CD regression suites for AEC agents
- Run AEC-Bench nightly to detect regressions when updating models/tools; gate deployments on category-specific performance.
- Tools/workflows: Containerized tasks; reward thresholds; model/version tracking.
- Assumptions/dependencies: Stable compute budget; agreement on pass/fail criteria.
Cross-domain (legal/healthcare/finance/manufacturing): Defect-injection evaluation pipelines
- Repurpose the semi-automated artifact-injection method to build realistic benchmarks for other document-heavy domains (e.g., contracts, IFUs, SOPs, P&IDs).
- Tools/workflows: Alignment-aware editing; before/after/diff validation; domain-expert curation.
- Assumptions/dependencies: Domain expertise; access to representative public documents; ethical/privacy constraints.
Public Sector / Policy: Evidence-based auditing of vendor claims
- Agencies can independently verify AI assistant performance on public projects using an open, Apache-2.0 benchmark.
- Tools/workflows: Standardized evaluation protocol; reproducible code; transparent scorecards per task family.
- Assumptions/dependencies: Policy willingness to require open evaluations; reproducibility in agency environments.

Long-Term Applications

Industry (AEC) / AI R&D: Spatially grounded drawing comprehension modules
- Develop VLM components that can trace leader lines, resolve symbols, and align text-to-geometry—addressing the benchmark’s core failure mode in note-callout accuracy and visually grounded tasks.
- Tools/workflows: New training datasets with symbol/leader annotations; region-level grounding; pixel-coordinate evidence outputs.
- Assumptions/dependencies: Large-scale curated data; novel perception models; evaluation protocols that check spatial grounding.
Industry (AEC): End-to-end submittal review copilot with judgment modeling
- Build agents that triage submittals and produce evidence-backed findings consistent with professional standards—addressing low current performance and over-generation risks.
- Tools/workflows: Hierarchical reasoning with verifiers, policy templates, and human-in-the-loop adjudication; uncertainty calibration.
- Assumptions/dependencies: Improved long-horizon planning; domain-knowledge infusion; clear liability and governance frameworks.
Industry (AEC): Real-time design review assistant during coordination sessions
- Live flag inconsistencies (cross-references, spec conflicts) as teams mark up drawings in Bluebeam/ACC; present linked evidence regions.
- Tools/workflows: Low-latency parsing and region retrieval; spatial memory across sheets; UI plugins for evidence overlays.
- Assumptions/dependencies: Plugin APIs; robust on-the-fly parsing; reliable grounding so as not to distract reviewers.
Standards / Policy: Machine-readable AEC document requirements
- Encourage or mandate parsable, standardized title blocks, callout schemas, and cross-reference formats (e.g., via ISO 19650-aligned extensions) to enable reliable automation.
- Tools/workflows: Industry working groups; conformance tests; benchmark-aligned certification criteria.
- Assumptions/dependencies: Ecosystem adoption by designers/contractors; updates to authoring templates and QA processes.
Software / Benchmarking: Evidence-grounded agent harness standards
- Require agents to cite coordinates/page regions for every claim; verifiers check both content and location to reduce hallucinations and ensure traceability.
- Tools/workflows: Harbor extensions to collect/certify evidence; UI for evidence replay; standardized JSON schemas for citations.
- Assumptions/dependencies: Model capabilities for fine-grained localization; acceptance by auditors/regulators.
Industry (AEC) / CAD-BIM: Authoring-time QA in Revit/AutoCAD
- Plugins that continuously check detail titles, cross-references, and sheet indices inside the authoring environment, reducing downstream RFIs/change orders.
- Tools/workflows: Mappings between native models and publish-to-PDF; preflight checks; actionable fix suggestions.
- Assumptions/dependencies: Vendor SDKs/APIs; reliable linkage between model objects and published sheets.
Cross-sector (Energy, Aerospace, Manufacturing): Multimodal agent benchmarks for technical drawings
- Extend AEC-Bench methodology to circuit diagrams, piping & instrumentation diagrams (P&IDs), aircraft maintenance manuals, grid schematics—where cross-document/spatial reasoning is critical.
- Tools/workflows: Domain-specific symbol libraries; task taxonomies paralleling intra-sheet/drawing/project scopes.
- Assumptions/dependencies: Access to public corpora; domain-expert curation; safety/regulatory buy-in.
AI Infrastructure: Memory- and token-efficient multimodal retrieval stacks
- Build context-engineering layers that index documents into linkable regions, compress context, and retrieve evidence just-in-time—reflecting benchmark findings about retrieval as a bottleneck.
- Tools/workflows: Region-level embeddings; structured stores for text/layout/geometry; retrieval-aware planning.
- Assumptions/dependencies: Efficient multimodal vector stores; orchestration libraries that manage region provenance.
Workforce Development: Credentialing for digital AEC coordination skills
- Create micro-credentials where trainees pass benchmarked tasks (e.g., navigation, reference tracing) with evidence-based reports.
- Tools/workflows: Proctored AEC-Bench subsets; dashboards for skill gaps; integration with professional development programs.
- Assumptions/dependencies: Industry acceptance; alignment with licensure/CPD frameworks.
Insurance / Risk: Performance-based underwriting for AI-assisted workflows
- Use benchmark scores by task category to model risk impacts (e.g., likelihood of change orders due to coordination errors), influencing premiums and guarantees.
- Tools/workflows: Actuarial studies linking benchmark outcomes to project KPIs; third-party certifications.
- Assumptions/dependencies: Sufficient historical data; regulator approval; demonstrated correlation between scores and outcomes.

Notes on feasibility and dependencies common across applications:

Retrieval-centric tasks are deployable now when coupled with structured parsing; spatial-grounding and high-judgment tasks need model advances and better evidence-verification loops.
Success depends on high-quality PDF parsing (text + layout + references), access to complete project document sets, integration with existing DMS/CDEs, and clear human-in-the-loop protocols.
For regulated uses, evidence traceability (page/region coordinates, before/after diffs) and auditability will be critical for adoption.

View Paper Prompt View All Prompts

Glossary

agentic systems: AI systems that can plan and act autonomously, often using tools and multi-step reasoning. "a multimodal benchmark for evaluating agentic systems on real-world tasks in the Architecture, Engineering, and Construction (AEC) domain."
alignment-aware text editing: Editing that preserves positional alignment and layout when modifying document text. "using alignment-aware text editing that preserves visual fidelity."
automated verifier: A programmatic grader that checks an agent’s final outputs against predefined criteria or answers. "Each instance is scored using a task-specific automated verifier against known ground truth."
callouts: Annotated labels with leader lines pointing to elements in a drawing to provide details or references. "A visually dense drawing sheet with tightly packed annotations, callouts, and linework typical of real construction coordination artifacts."
coding-agent harnesses: Execution frameworks that let LLMs act as coding agents with tools (e.g., shell, code execution) to solve tasks. "Foundation models equipped with coding-agent harnesses demonstrate strong capabilities in software engineering workflows, where agents can search repositories, edit code, and verify outputs through tool use."
cross-document coordination: Ensuring consistency and resolving conflicts across multiple documents within a project. "fail on visual grounding, exhaustive traversal, and cross-document coordination."
cross-reference resolution: The process of verifying that references within drawings point to valid, correct targets. "For cross-reference resolution, the edits introduce subtle reference-number inconsistencies that require cross-sheet verification;"
cross-sheet navigation: Moving between different sheets/pages of a drawing set to trace references and gather context. "including intra-sheet review, cross-sheet navigation, and project-level document alignment."
cross-sheet references: Explicit links or pointers from one drawing sheet to elements on another sheet. "geometric regions, and cross-sheet references."
detail titles: Titles labeling specific drawing details that should accurately describe the associated content. "verifying detail titles"
drawing set: A multi-page collection of construction drawings that collectively define design intent and requirements. "In practical terms, a drawing set is a multi-page, multimodal instruction package in which meaning is conveyed through structured visual and textual elements such as plans, details, callouts, notes, and title blocks."
drawing-navigation: A task type focused on locating the correct file, sheet, or detail in a drawing corpus. "Similar gains are observed for spec-drawing-sync +20.8% and drawing-navigation +18.75%,"
execution primitives: Low-level, reliable operations provided by a system that agents can compose to perform tasks. "These systems rely on structured, verifiable environments and well-defined execution primitives, which enable reliable multi-step reasoning and self-correction."
exhaustive traversal: Systematic, comprehensive exploration of documents or structures to ensure no relevant elements are missed. "fail on visual grounding, exhaustive traversal, and cross-document coordination."
geometric reasoning: Interpreting and reasoning about shapes, spatial relationships, and geometry within visual documents. "vision-based tools lack the quality needed for reliable geometric reasoning."
ground truth: The authoritative correct answers or annotations used to evaluate model outputs. "Each instance is scored using a task-specific automated verifier against known ground truth."
Harbor harness: The execution framework used to run tasks and interface with agents via standardized tools and formats. "Tasks are defined using the Harbor task format and executed within the Harbor harness"
Intra-Drawing: Tasks that require reasoning across multiple sheets within the same drawing set. "Intra-Drawing: Tasks that require reasoning across multiple sheets within the same drawing set."
Intra-Project: Tasks that involve coordination across multiple documents (drawings, specifications, submittals). "Intra-Project: Tasks that involve multiple documents, such as drawings, specifications, and submittals."
Intra-Sheet: Tasks solvable using information on a single sheet/page. "Intra-Sheet: Tasks that can be completed using a single sheet (one PDF page)."
leader lines: Thin lines that connect callout text to the specific element being annotated in a drawing. "the model must trace leader lines and related geometry purely visually to correctly identify and report issues;"
linework: The lines and graphical elements composing the visual structure of technical drawings. "A visually dense drawing sheet with tightly packed annotations, callouts, and linework typical of real construction coordination artifacts."
Nomic Embeddings: Embedding models provided by Nomic to create structured representations for retrieval and analysis. "Nomic Parse and Nomic Embeddings"
Nomic Parse: A parsing tool by Nomic that extracts structured elements (text, layout, references) from PDFs. "Nomic Parse and Nomic Embeddings"
parse augmentation: Enhancing an agent setup by adding parsed, structured document representations to improve performance. "This side-by-side layout isolates the effect of parse augmentation while keeping the underlying harness fixed."
pdftoppm: A command-line utility to rasterize PDF pages into images (PPM/PNG formats). "Codex agents also rely heavily on rasterization via pdftoppm (96% of runs), whereas Claude agents use it far less frequently (32% of runs)."
pdftotext: A command-line utility that extracts text from PDFs, often losing spatial layout information. "Across all models, 77% of trajectories invoke pdftotext, indicating strong convergence toward a pdftotext extraction pipeline;"
pixel-level differencing: Image comparison method that highlights exact pixel changes between two versions of a document. "pixel-level differencing (before, after, and diff)"
project-level document alignment: Ensuring consistency and agreement across drawings, specifications, and related project documents. "including intra-sheet review, cross-sheet navigation, and project-level document alignment."
rasterization: Converting vector-based PDF content into raster images for visual processing. "Codex agents also rely heavily on rasterization via pdftoppm"
retrieval-sensitive tasks: Tasks where success is dominated by the ability to find the correct evidence before reasoning. "Under the H+\nomicN, we observe substantial average gains across models, specifically on retrieval-sensitive tasks."
sheet index: The index listing of sheets in a drawing set, often cross-checked against titles and content. "Compare sheet index entries against title blocks for mismatches."
specifications: Formal written documents detailing materials, standards, and requirements that complement drawings. "such as project specifications"
structured representations: Parsed, schema-like data extracted from documents that preserve layout and relationships. "agents are provided with structured representations of drawings, including extracted text, layout elements, and reference relationships"
submittals: Contractor-provided documents (e.g., product data, samples) for approval against project requirements. "Evaluate submittals for compliance with specs and drawings."
symbol-centric understanding: The ability to recognize and interpret standard symbols and notations used in technical drawings. "still lack robust drawing literacy, especially for symbol-centric understanding."
title blocks: Standardized sections on drawing sheets containing metadata like project name, sheet number, and author. "Compare sheet index entries against title blocks for mismatches."
tool ablations: Systematic removal or variation of tools in experiments to measure their impact on performance. "We report results across tool ablations of our AEC agent harness."
toolchain: A set of tools used in sequence to process documents into analyzable formats. "Our PDF toolchain breaks these multi-page documents into structured, machine-readable data"
traceability: The ability to trace extracted or edited elements back to their original source locations. "We cache these artifacts with source coordinates for full traceability."
visual grounding: Tying textual findings or references to the correct visual regions or elements in documents. "fail on visual grounding, exhaustive traversal, and cross-document coordination."
visual-spatial grounding: Accurately mapping and reasoning about spatial relationships in visual documents to support precise localization. "On tasks that rely more heavily on visual-spatial grounding."

View Paper Prompt View All Prompts

Open Problems

We're still in the process of identifying open problems mentioned in this paper. Please check back in a few minutes.

AEC-Bench: A Multimodal Benchmark for Agentic Systems in Architecture, Engineering, and Construction

Summary

AEC-Bench: Evaluating Multimodal Agentic Systems in Architecture, Engineering, and Construction

Benchmark Motivation and Domain Challenges

Benchmark Construction and Task Taxonomy

Baseline Harnesses, Tools, and Experimental Setup

Empirical Results and Performance Analysis

Critical and Contradictory Findings

Theoretical and Practical Implications

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

AEC‑Bench: A simple explanation for teens

What is this paper about?

What questions are the researchers asking?

How did they do it?

What did they find, and why is it important?

What could this change in the future?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Dataset scope and representativeness

Evaluation protocol and metrics

Agent/harness design and orchestration

Representation and perception for drawings

Task design and coverage

Generalization, robustness, and transfer

Safety, reliability, and usability

Reproducibility and fairness

Open methodological questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research