DrafterBench: LLM Benchmark for Drawing Revisions

Updated 20 July 2025

DrafterBench is a benchmark evaluating LLM agents on technical drawing revision tasks with a focus on civil engineering.
It organizes 1920 tasks across 12 types using custom PDF and computer vision tools to assess structured data understanding and function execution.
The framework rigorously measures code executability and multi-step workflow completeness using a detailed scoring methodology.

DrafterBench is an open-source benchmark designed for the rigorous evaluation of LLM agents in the automation of technical drawing revision tasks, with a particular emphasis on civil engineering applications. It addresses the need for systematic and industrially relevant testing of LLM-based automation agents, focusing on the challenging domain of drawing revision—a critical, monotonous, and high-stakes activity in construction and design workflows. The benchmark provides a structured suite of tasks, tools, and evaluation methodologies to quantify an agent's proficiency in interpreting complex instructions, manipulating structured graphical and textual data, and applying policies in dynamic, ambiguous scenarios (Li et al., 15 Jul 2025).

1. Motivation and Context

DrafterBench was motivated by the demand in civil engineering to automate labor-intensive and error-prone tasks related to technical drawing revision. In practice, revising texts, tables, and vector entities within engineering drawings is a frequent activity, with even minor mistakes potentially causing significant project setbacks. Traditional manual approaches are inefficient, and the potential of LLMs for task automation in this domain is of growing industrial interest. DrafterBench aims to provide a robust benchmark for evaluating LLM agents on realistic, high-complexity revisions that mirror actual workflows encountered by civil engineers and construction professionals.

2. Benchmark Organization and Task Taxonomy

DrafterBench is structured around tasks distilled from more than 100 real-world drawing revision files. The benchmark comprises:

12 task types, each defined over combinations of three target elements (text, table, vector entity) and four categories of operations (adding, content modification, mapping, and format updating).
A total of 1920 tasks, each annotated with six parameters controlling difficulty, such as instruction structure (structured vs. unstructured), completeness (complete vs. incomplete), value precision (precise vs. vague), the number of objects and operations (single/multiple), and more.
46 customized functions/tools built for PDF-based drawing revision. These functions utilize established libraries like PyMuPDF, ReportLab, OpenCV, and pytesseract, enabling the simulation of real-world drawing manipulation, including extraction, modification, and saving of graphical and textual elements.

Task levels are designed to reflect varying industrial complexity, and difficulty is carefully controlled to assess both basic and advanced capabilities of automation agents.

3. Evaluation Criteria and Methodology

DrafterBench rigorously examines LLM agent performance across four key capabilities:

Structured Data Understanding: The agent's ability to accurately parse and comprehend instructions across varying verbosity and formats.
Function Execution: Accurate calling of drawing revision functions, which entails appropriate argument definition and adherence to data type requirements.
Instruction Following: Execution of complete workflows (e.g., file operations, element deletions/modifications) in accordance with explicit user commands and implicit operational policies.
Critical Reasoning: Handling of vague or incomplete instructions; the agent must log or infer missing information, avoiding unwarranted changes.

The benchmark employs a detailed two-level evaluation:

Level 1: Assesses the executability of agent-generated code, requiring error-free execution with dual functions that record all revisions.
Level 2: Compares the logged operation path of the agent’s code against ground-truth operation paths for target completeness. Subtasks in this layer include argument definition, variable transfer, function/tool selection, and multi-step plan execution.

The scoring framework allocates 30 points for executability and 70 points distributed across the six subtasks measuring target completeness. The principal equation for scoring subtasks applies:

$\text{Score} = \frac{(\mathrm{TP} - \mathrm{FP})}{(\mathrm{TP} + \mathrm{FN})} \times \frac{70}{6}$

where $\mathrm{TP}$ denotes true positives, $\mathrm{FP}$ false positives, and $\mathrm{FN}$ false negatives. Additional equations, such as Intersection over Union for complex plan execution, further refine the completeness assessment.

4. Technical Implementation

DrafterBench provides both the benchmark data and a suite of dual (mirrored and recording) functions for each tool, enabling granular logging and analysis of agent behavior. Subtasks including argument specification, function invocation, and workflow orchestration require generation of executable Python code that interfaces with the provided toolkits.

Tools are largely constructed atop open-source PDF and computer vision libraries, maximizing accessibility and facilitating real-world adoption. All benchmark resources—including explicit test sets and agent prompt templates—are made available on GitHub and Hugging Face, supporting open experimentation and extensions.

Resource	Location
Code/benchmark	https://github.com/Eason-Li-AIS/DrafterBench
Test datasets	https://huggingface.co/datasets/Eason666/DrafterBench

5. Agent Performance and Insights

Comprehensive testing was undertaken with LLM agents from vendors such as OpenAI, Anthropic, Deepseek, Qwen, and Llama. Key empirical observations include:

The best models (e.g., GPT API version o1) achieved average total task scores of ~80/100, suggesting current LLMs can automate many subtasks but encounter difficulties under industrial-grade complexity.
Agents performed well in structured data interpretation and tool selection, but errors increased notably in plan execution, especially where multi-step, coordinated operations were necessary.
An observed performance gap of ~20 percentage points between simple argument definition tasks and the more demanding plan execution highlights limitations in current LLMs’ critical reasoning and policy adherence in real-world workflows.
Agents showed struggles with ambiguous instructions, often failing to log missing data or infer parameter values, which is essential in professional contexts.

6. Design Significance and Open Challenges

DrafterBench’s contribution lies in its multidimensional, industry-inspired approach to LLM benchmarking:

It spans a realistic set of task types and difficulty levels, providing fine-grained diagnostics on automation agent proficiency in an industrial setting.
The benchmark's explicit separation of code executability and workflow completeness supports targeted error analysis and improvement prioritization.
Its open-source release, with reproducible resources for both evaluation and agent deployment, ensures extensibility for community-driven improvements.

Persisting challenges identified by DrafterBench include LLMs’ handling of incomplete or vague instructions, capacity for inferring missing information, and flexibility in adopting custom policies without model-level resistance. Addressing these areas is critical for the reliable industrial adoption of LLM-driven automation.

7. Future Directions

Several avenues for further research and benchmark evolution are informed by initial evaluation outcomes:

Enhanced Interactive Agents: Improving LLMs’ capacity to recognize incomplete tasks and apply context-appropriate policies rather than defaulting to clarification or placeholder output.
Instruction Contextualization: Fostering agent abilities to infer user intent and operationalize vague instructions through logical reasoning and policy enforcement.
Custom Policy Integration: Enabling LLMs to reconcile intrinsic model behaviors with evolving, domain-specific best practices and regulatory protocols in engineering applications.

The open-access nature of DrafterBench supports continuous benchmark evolution and the emergence of new evaluation paradigms as the automation landscape matures in civil engineering and related disciplines.

DrafterBench thus provides a robust, multi-dimensional framework for benchmarking LLM-based automation in civil engineering drawing revision, setting a foundation for transferable evaluation methodologies in other industrial domains (Li et al., 15 Jul 2025).

PDF Markdown Chat (Upgrade)

References (1)

1.

DrafterBench: Benchmarking Large Language Models for Tasks Automation in Civil Engineering (2025)