AECV-Bench: Evaluating AEC Drawing Literacy
- AECV-Bench is a benchmarking suite designed to evaluate multimodal models’ ability to interpret architectural drawings, focusing on graphical, spatial, and symbolic reasoning.
- It features two primary tasks—object counting and drawing-grounded document QA—leveraging rigorously annotated datasets with clear object definitions and detailed evaluation metrics.
- Experiments show that while models excel at reading room labels, they struggle with symbol-centric tasks such as door and window recognition, highlighting a need for domain-specific improvements.
AECV-Bench is a benchmarking suite designed to systematically evaluate the capabilities of multimodal and vision-LLMs for interpreting architectural and engineering drawings. Unlike benchmarks focused solely on textual or domain knowledge, AECV-Bench targets the graphical, spatial, and symbolic language intrinsic to AEC artefacts, including floor plans and annotated construction documents. It provides rigorous assessment through two core use cases—object counting and drawing-grounded document question answering (QA)—offering granular results that illuminate current system strengths and systematic weaknesses in drawing literacy for automated AEC workflows (Kondratenko et al., 8 Jan 2026).
1. Dataset Construction and Annotation Protocols
AECV-Bench is organized into two principal datasets: the object-counting subset and the drawing-grounded document QA subset.
- Object-counting subset includes 120 rasterized floor plan images sourced from CubiCasa5K, CVC-FP, and other public collections. Four object categories are annotated: doors (openable swing doors; double-leaf counted as two), windows (including grouped windows), bedrooms (explicit tags in multiple languages), and toilets (inclusive of showers and baths but not lavatories). Distribution statistics across 120 plans approximate 1,200 doors, 720 windows, 360 bedrooms, and 240 toilets.
- Drawing-grounded QA subset comprises 192 annotated question-answer pairs distributed across 21 plans. Questions test OCR (e.g., extracting scale from title blocks), instance counting (e.g., section-view callouts), spatial reasoning (e.g., identifying footing types at grid intersections), and comparative reasoning (e.g., determining the largest room).
Annotation conventions enforce strict object definitions and exclusion rules (e.g., pantry sliders excluded as doors), with all QA recorded in JSON schema containing image identifiers, question metadata, and reference evidence for human adjudication. Object labels and QA answers are grounded in explicit symbols, tags, and drawing regions to minimize ambiguity and maximize interpretability.
2. Task Definitions and Evaluation Metrics
Two complementary task families are defined:
- Object Counting: Given a floor-plan image, models are prompted to enumerate objects in all four classes and return the results in a constrained JSON format. Typical queries instruct the model to "fully understand the drawing first, then count doors, windows, bedrooms, toilets, and return ONLY the JSON."
- Drawing-Grounded Document QA: Each QA pair consists of a drawing and a natural-language question. Model responses span OCR (plain text extraction), counting (numeric/JSON), spatial location identification, and comparison (determining maximum/minimum properties).
Metrics are designed for accuracy and robustness:
- Object Counting: Evaluated using per-field exact-match accuracy (EM) and mean absolute percentage error (MAPE):
- QA Tasks: Binary accuracy per question, overall and breakdown per category. Responses are automatically adjudicated by an LLM-as-a-judge pipeline and flagged edge cases (~15%) are manually reviewed.
3. Models Benchmarked and Results
AECV-Bench evaluates both proprietary and open-source multimodal and vision-LLMs using a unified protocol. Proprietary models include Google Gemini 3 Pro, OpenAI GPT-5.2, Anthropic Claude Opus 4.5, xAI Grok 4.1 Fast, Amazon Nova 2 Lite, and Cohere Command A Vision. Open-source offerings include Mistral Large 3, Qwen3 VL 8B Instruct, GLM-4.6V, and NVIDIA Nemotron Nano 12B V2 VL.
Object Counting Results
| Model | Mean EM | Door EM | Window EM | Bedroom EM | Toilet EM | Mean MAPE (%) |
|---|---|---|---|---|---|---|
| Gemini 3 Pro | 0.51 | 0.39 | 0.34 | 0.89 | 0.82 | 16.0 |
| GPT-5.2 | 0.49 | 0.28 | 0.27 | 0.91 | 0.76 | 19.4 |
| Claude 4.5 | 0.42 | 0.16 | 0.16 | 0.91 | 0.76 | 24.9 |
| GLM-4.6V | 0.39 | 0.09 | 0.03 | 0.79 | 0.82 | 29.5 |
| Mistral 3 | 0.32 | 0.10 | 0.09 | 0.74 | 0.46 | 39.2 |
Bedroom and toilet counting approaches expert-level exact-match accuracy (∼0.74–0.91); doors and windows lag (∼0.09–0.39), with 20–50% MAPE for these symbol-centric fields. This disparity highlights the difficulty models face in interpreting line-art symbols.
QA Results
| Model | Overall QA Acc. |
|---|---|
| Gemini 3 Pro | 0.854 |
| GPT-5.2 | 0.792 |
| Claude 4.5 | 0.719 |
| GLM-4.6V | 0.604 |
| Grok 4.1 | 0.312 |
Accuracy by QA type:
- OCR: up to 0.95 (Gemini 3 Pro), typically 0.70–0.95
- Spatial reasoning: 0.60–0.75
- Instance counting: 0.40–0.55
- Comparative reasoning: 0.65–0.80
A stable capability gradient emerges: text-centric OCR and document QA are reliably solved, spatial reasoning is moderate, and symbol-centric drawing understanding (especially for doors and windows) remains largely unsolved.
4. Failure Modes, Identified Gaps, and Recommendations
Analysis reveals several persistent deficits:
- Drawing literacy: Models struggle with consistent interpretation of line-art symbols such as door swings and window breaks, with high variance across CAD conventions.
- Symbol understanding: Substantial proportional errors for door and window classes indicate a lack of generalizable symbolic reasoning.
- Over-reliance on OCR: Bedroom and toilet counts are frequently deduced from room labels rather than symbol parsing.
Authors recommend progress via:
- Domain-specific representations: Hybrid raster–vector encoders, graph-based scene parsers, and neuro-symbolic modules capable of explicit topological reasoning;
- Tool augmentation: Integration of specialized symbol detectors (e.g., YOLO variants), geometric solvers, and connected-component analysis with LLMs to enhance object recognition;
- Human-in-the-loop systems: Exposure of model uncertainty (low-confidence counts), pipelines for active learning via human correction, and the use of structured outputs (room lists, adjacency graphs) instead of unconstrained generation.
5. Limitations and Scope of AECV-Bench
AECV-Bench has several circumscribed limitations:
- Dataset size: 120 plans, 192 QA pairs result in limited coverage of global AEC conventions and drawing styles.
- Raster-only input: CAD/BIM vector semantics (DWG, IFC) are excluded, precluding evaluation of rich geometric and semantic content.
- Single-image evaluation: No cross-sheet reasoning (e.g., callout tracing across multiple drawings).
- Object classes restricted: Only four types (doors, windows, bedrooms, toilets); others (stairs, columns, HVAC) are not considered.
These limitations circumscribe the generality of model analyses and prevent extrapolation to comprehensive industry workflows. This suggests that expanded datasets and richer input modalities are necessary for robust benchmarking.
6. Future Directions and Extensions
Authors articulate several actionable paths forward:
- Expansion with diverse, industry-sourced drawings and broader query types;
- Multi-page, cross-referenced evaluation protocols to test longitudinal reasoning and callout tracing;
- A continuous public leaderboard with periodic refresh as new models become available;
- Development of specialized "AEC-native" models pre-trained on CAD/BIM exports (DWG, IFC) and rich annotated symbology.
A plausible implication is that future benchmarks will emphasize multi-modal fusion (raster, vector, semantic) and tool-augmented architectures to tackle unresolved challenges in drawing literacy. All code, data, and evaluation scripts are released at https://github.com/AECFoundry/AECV-Bench to encourage community development and iterative improvement (Kondratenko et al., 8 Jan 2026).
7. Impact and Position within the Literature
AECV-Bench establishes a rigorous blueprint for benchmarking multimodal AI capabilities in practical AEC contexts. It complements cognitive benchmarks such as AECBench (Liang et al., 23 Sep 2025), which focus on textual and domain knowledge evaluation for LLMs, by highlighting the unresolved gap in symbol and spatial reasoning over graphical artefacts fundamental to architectural and engineering workflows. In contrast to benchmarks like AetherVision-Bench (Sikdar et al., 4 Jun 2025), which target open-vocabulary segmentation and pixel-wise multimodal robustness, AECV-Bench is distinguished by its task focus (object counting, drawing-grounded QA), annotation fidelity, and emphasis on operational AEC deliverables.
The collected results indicate that contemporary models function well as document assistants but lack robust drawing literacy, motivating research into domain-specific representations, hybrid tool-augmented workflows, and human-in-the-loop protocols for safe and efficient AEC automation.