Benchmarking and Evaluating VLMs for Software Architecture Diagram Understanding

Published 5 Apr 2026 in cs.SE | (2604.04009v1)

Abstract: Software architecture diagrams are important design artifacts for communicating system structure, behavior, and data organization throughout the software development lifecycle. Although recent progress in LLMs has substantially advanced code-centric software engineering tasks such as code generation, testing, and maintenance, the ability of modern vision-LLMs (VLMs) to understand software architecture diagrams remains underexplored. To address this gap, we present SADU, a benchmark for Software Architecture Diagram Understanding that evaluates VLMs on architecture diagrams as structured software engineering artifacts rather than generic images. SADU contains 154 carefully curated diagrams spanning behavioral, structural, and ER diagrams, paired with structured annotations and 2,431 question-answer tasks covering counting and retrieval reasoning. We evaluate 11 state-of-the-art VLMs from the Gemini, Claude, GPT, and Qwen families. Our results show that software architecture diagram understanding remains challenging for current models: the best-performing model gemini-3-flash-preview achieves only 70.18\% accuracy, while gpt-4o-mini only achieves 17.77\% accuracy. The results further reveal the weaknesses in diagram reasoning and visual relation grounding, highlighting a gap between current VLMs and the needs of design-stage software engineering. SADU provides a foundation for future research on diagram-aware AI systems and more faithful AI-assisted software engineering workflows.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper presents SADU, a benchmark constructed from 154 curated diagrams with 2,431 QA pairs to assess VLMs on diagram recognition and reasoning tasks.
The study finds that state-of-the-art VLMs achieve best accuracy around 70%, revealing significant challenges in handling diagram complexity and structured spatial information.
Results highlight critical failure modes like counting errors and retrieval inconsistencies, underscoring the need for improved domain-specific prompts and spatial reasoning.

Benchmarking Vision-LLMs for Software Architecture Diagram Understanding

Motivation and Problem Setting

Software architecture diagrams serve as essential design artifacts in the software engineering lifecycle, encoding structural, behavioral, and data-centric information about system components and their interactions. Despite the proficiency of contemporary LLMs in code-generative tasks, the diagram-centric stages of software engineering remain vastly underexplored, particularly in the context of machine comprehension of visual artifacts. The paper presents SADU ("Software Architecture Diagram Understanding"), a benchmark specifically constructed to quantitatively evaluate VLMs (Vision-LLMs) on diagram-aware tasks, probing both recognition and reasoning in a modality closely tied to design semantics.

Figure 1: Overview of the SADU construction pipeline.

SADU Benchmark: Construction and Scope

SADU comprises 154 diagrams spanning behavioral, structural, and ER (Entity–Relationship) families collected from varied sources: Azure Architecture documentation, PyUNML, and several diagram-design platforms. Careful curation ensured high visual quality and coverage diversity, discarding low-resolution, unreadable, or non-architectural artifacts. Each diagram is annotated in a machine-readable JSON schema capturing entities, relations (with directionality and labels), clusters, attributes, and methods, establishing deterministic ground-truth for evaluation.

A set of 2,431 question-answer pairs targets 24 subtype templates, split between counting (e.g., entity cardinality) and retrieval (e.g., correct relation extraction). The QA pairs are designed to probe the models' ability to perform both perceptual recognition and higher-order reasoning.

Figure 2: Distribution and definitions for counting questions across software architecture diagrams in SADU.

Experimental Design and VLM Evaluation

The study evaluates 11 VLMs across the Gemini, Claude, GPT, and Qwen families, encompassing both proprietary (e.g., gemini-3-flash-preview, claude-sonnet-4.5, gpt-5-nano) and open-weight models (e.g., qwen-2.5-VL-32B/7B/3B). Each QA task is prompted with explicit domain definitions and answering constraints, ensuring standardized input and facilitating robust comparative analysis. Model outputs are assessed by both rule-based deterministic scoring (exact match, tolerance, F1, subset/superset behaviors) and LLM-as-a-judge evaluation, the latter providing a more flexible semantic assessment.

Empirical Results and Analysis

Aggregate Model Accuracy

Empirical results indicate that software architecture diagram understanding remains an unsolved challenge for current VLMs. The strongest model, gemini-3-flash-preview, attains an overall accuracy of only 70.18%, followed by Gemini variants and Claude/gpt-5-nano in the mid-50% range. Open-source models (Qwen family) and lighter OpenAI variants perform substantially worse. Notably, higher model capacity does not guarantee higher efficiency—gpt-5-nano uses nearly 1,500 completion tokens per request, yet fails to outperform Gemini in accuracy.

Accuracy varies strongly by diagram type: ER diagrams are comparatively easier (e.g., gemini-2.5-flash reaches 82.54%), while behavioral diagrams induce greater model variance. Diagram complexity negatively correlates with accuracy, emphasizing the difficulty of visually dense or structurally intricate artifacts.

Figure 3: Model accuracy as a function of diagram complexity (entity and relation count).

Counting and Retrieval Failure Modes

Counting errors are predominantly near-miss (off-by-one/two) for stronger models (MAE $\leq 1$ ), but degrade sharply for weaker models (MAE $\geq 3$ ). Stronger VLMs maintain balanced bias in counting, while weaker models systematically under-count elements in complex diagrams, often missing entire clusters or relations. Tolerance-based metrics reveal that even when exact match fails, most errors for the strongest models remain within two counts.

Retrieval tasks are notably harder; even the best models frequently omit relevant elements (high subset rates) or introduce extraneous ones (superset errors are more prevalent in weaker models). Precision and recall metrics confirm partial lexical similarity is commonplace, but semantic grounding is inconsistent.

Prompting and Reasoning Configuration

The inclusion of domain-specific definitions in prompts yields substantial accuracy gains; explicit answering rules exert smaller but measurable effects. Absent definitions or rules, accuracy declines significantly. Interestingly, reasoning configuration (thinking levels) does not monotonically improve accuracy—Gemini's low thinking setting coupled with full prompt achieves the best result (72.85%). Overthinking phenomena (task confusion, scope inflation) induce more errors at higher reasoning levels, observed as both over-deliberated and erroneously substituted answers.

Evaluation Protocol Consistency

LLM-as-a-judge and rule-based scoring are highly correlated ( $\approx$ 90%+ agreement for most models), but LLM judges are more permissive for numerically or lexically “close” answers, and can occasionally accept semantically inconsistent responses that diverge from gold annotation.

Case Study: Visual Relation Reasoning

Hard cases such as long, multi, non-rightward, and overlapping arrows introduce significant visual grounding challenges in SADU. Strongest models (Gemini-3.1-flash-lite-preview) achieve up to 93.33% on Long Arrow but drop drastically (20–33%) on Overlap Arrow and Non-Rightward Arrow cases. These failures highlight brittleness in relational parsing and spatial inference. Models seldom generalize robustly to non-standard connector layouts, suggesting that visual relational semantics are a critical bottleneck in VLM architecture diagram understanding.

Limitations

The SADU benchmark omits fine-grained spatial annotation, restricting the granularity of failure-mode analysis, especially for visual ambiguity caused by layout geometry. Some diagrams may have appeared in VLM training corpora, but diversity and task focus are argued to mitigate memorization risk.

Implications and Future Directions

The findings indicate that current VLMs lack the structured reasoning and spatial abstraction required for robust diagram understanding, with practical deployment in high-stakes engineering settings not yet viable. Accurate comprehension of architecture diagrams is necessary for consistent upstream–downstream alignment in software development; current models remain unreliable.

Future developments should target architecture-aware visual reasoning, spatially grounded multimodal parsing, and domain-adaptive prompting strategies. Integration of geometric features, relational embeddings, and intent-level schema extraction will be crucial. SADU establishes a rigorous foundation for subsequent benchmarks and architectural AI workflows, enabling systematic measurement of progress.

Conclusion

Software architecture diagram understanding is a fundamentally different challenge from generic VQA, requiring invariance under layout complexity, visual ambiguity, and domain abstractions. SADU reveals a substantial accuracy gap and numerous structured failure modes in current VLMs. The benchmark and analyses provide actionable baselines and guidance for accelerating progress toward reliable diagram-aware AI systems in software engineering.

Markdown Report Issue