Paper2SysArch: Automated Diagram Extraction
- The paper introduces a novel benchmark and multi-agent pipeline that extracts structured system architecture diagrams, ensuring semantic fidelity and hierarchical precision.
- It leverages a three-phase annotation and evaluation process, integrating semantic, layout, and visual quality metrics for comprehensive diagram assessment.
- The system achieves state-of-the-art performance with editable, structure-aware outputs, setting new standards for reproducibility and visualization in research.
Paper2SysArch is both a large-scale benchmark and an agent-driven system for automated extraction of structured system architecture diagrams from scientific papers. It addresses the fundamental challenge that manually creating such diagrams is time-consuming, subjective, and lacks standardization, while existing generative methods have lacked the structural fidelity and semantic precision required for reproducible research. The approach establishes a new evaluation benchmark and proposes a comprehensive pipeline for turning a paper’s full text into structured, editable architecture diagrams with explicit structural constraints (Guo et al., 22 Nov 2025).
1. Benchmark Foundation: Dataset Construction and Structure
The Paper2SysArch benchmark consists of 3,000 research papers paired with carefully curated "ground-truth" system architecture diagrams. Source papers are drawn from leading venues in artificial intelligence, systems, and machine learning (CVPR, NeurIPS, OSDI, etc.) over the preceding five years. Images and texts are parsed using PyMuPDF, with vision-LLMs (VLMs) classifying and filtering for diagrams that illustrate the main methodological system of each paper (confidence threshold ≥ 0.75). The centrality, explicit structural clarity, and minimal complexity (≥3 components) of each selected diagram are manually and automatically validated.
Annotation leverages a three-phase process:
- Phase I: Principle distillation by experts to define semantic and structural quality.
- Phase II: Multi-agent annotation leveraging extracted principles for automated labeling.
- Phase III: Cross-validation and harmonization by expert annotators.
Each diagram is annotated using a three-layer hierarchical "graphJSON" schema:
- Module Layer: High-level methodological or functional stages.
- Tool/Data Layer: Execution units and data objects.
- Component Layer: Atomic UI/images/text, icons.
Edges are strictly permitted only among sibling nodes under a common parent, enforcing modular encapsulation and avoiding cross-hierarchy violations. The test subset (108 papers) covers AI systems, computer vision, NLP, and core ML, with documented substantial structural and domain diversity (Guo et al., 22 Nov 2025).
2. Multi-Dimensional Evaluation Metric
Evaluation of system-architecture extraction incorporates three principal axes:
Semantic Accuracy:
Hierarchical graph matching is performed using a composite node similarity score: where
- : Label/description similarity (BERT-based).
- : Out-/in-degree match penalty ().
- : Ancestor chain text similarity.
- : Matched neighborhood proportion.
Metrics report node, edge, and hierarchy consistency after optimal matching.
Layout Coherence:
A specialized agent counts edge crossings, element overlaps, and text overflows. The score is: where δ is a fixed penalty.
Visual Quality:
Icon relevance, global system understandability, and text legibility are appraised by multi-modal agents, using semantic alignment (VLM), simulated summaries, and detection of blur/ambiguity, respectively.
Scores are aggregated as: This weighting favors visual fidelity, reflecting the pragmatic function of the diagrams (Guo et al., 22 Nov 2025).
3. Paper2SysArch System Pipeline and Agent Architecture
The Paper2SysArch system is an end-to-end, multi-agent framework employing a collaborative LLM-based workflow to generate editable diagrams:
Phase I: Semantic Parsing & Macro-Planning
- AnalystAgents parse paper texts, extracting task goals, major modules, principal data flows, algorithms, and constraints.
- ArchitectAgents instantiate a coarse top-level (L₁) module graph.
Phase II: Distributed Drafting & Global Alignment
- DesignerAgents independently instantiate L₂, L₃ subgraphs for each module, capturing tool/data and component layers.
- All graphs are resolved under GlobalContext for ID unification, interface consistency, and naming conventions.
Phase III: Topological Regularization
- Edge constraints (sibling-only, hierarchical) are enforced through LLM-based semantic filtering and programmatic pruning, guaranteeing structurally legal outputs.
Phase IV: Adaptive Layout and Multi-modal Rendering
- Node dimensions scale with text length; global placement is computed using the Eclipse Layout Kernel (ELK) for overlap-free visualization.
- Text-to-image models synthesize icons; the result is compiled into both high-res PNG and editable PowerPoint (pptx) outputs.
LLMs (e.g., GPT-4o, Qwen2.5-VL) act as primary semantic agents; CLIP/Sentence-BERT perform graph-element similarity, and ELK manages layout optimization. The system ensures outputs conform to strict structural and visual requirements by design (Guo et al., 22 Nov 2025).
4. Empirical Results and Baseline Comparison
Experimental evaluation on the manually labeled test subset (108 papers across four domains) compared Paper2SysArch against both modern text-to-image and code-to-image baselines. Results demonstrate:
| Method | Semantic | Layout | Visual | Overall | Editable |
|---|---|---|---|---|---|
| DALL-E 3 | 29.9 | 20.7 | 65.2 | 41.3 | ✗ |
| Nanobanana | 31.2 | 80.3 | 72.8 | 62.6 | ✗ |
| GPT-4o + GraphViz | 42.0 | 85.5 | 71.2 | 66.7 | ✔ |
| Paper2SysArch (Qwen2) | 38.7 | 76.8 | 84.6 | 68.5 | ✔ |
| Paper2SysArch (GPT-4o) | 29.8 | 83.9 | 87.3 | 69.0 | ✔ |
Paper2SysArch (GPT-4o) achieved the highest overall score (69.0), primarily due to superior layout (83.9) and visual quality (87.3), although semantic accuracy lags behind code-driven GraphViz (42.0 vs. 29.8). The system's editable, structure-aware pptx output offers significant practical advantages over pixel-based diagrams (Guo et al., 22 Nov 2025).
Relative to prior diagram benchmarks (e.g., Paper2Poster, Paper2Video), Paper2SysArch is unique in scale (3,000 pairs, multi-domain), strict structure constraints, and the granularity of its multi-layer annotation and evaluation protocol.
5. Structural Constraints and Agentic Enforcement
A defining feature is the enforcement of structure via graphJSON constraints:
- Three explicit layers: module, tool/data, component.
- Edges strictly allowed only among siblings, reflecting modular encapsulation.
- The constraints are implemented through prompt engineering for the LLM agents and post-processing logic.
All agents conform to this inductive hierarchy, which prevents cross-level or loose interconnections and ensures the resulting diagrams can be both directly interpreted and semantically grounded for subsequent toolchain use.
6. Limitations and Prospects for Future Research
Paper2SysArch exposes persistent challenges:
- Semantic matching of complex, long-range structural dependencies in free-text remains a bottleneck.
- Aesthetic layout is limited by current deterministic, open-source algorithms (ELK), restricting style diversity.
- The evaluation relies on VLM-based metrics, inheriting possible biases and stability issues inherent to foundation models.
Identified avenues for advancement include:
- Incorporation of task-specific graph neural modules to improve semantic extraction and node/edge alignment.
- Learnable layout modules trained end-to-end on expert-drawn diagrams.
- Integration of user-in-the-loop correction for ambiguous or incomplete architectural information.
- Extending methods to other visual artifact types (flows, UML, multilingual diagrams) (Guo et al., 22 Nov 2025).
These developments would further bridge the gap between unstructured scientific text and high-fidelity, machine-usable architectural knowledge.
7. Broader Implications for Automated Scientific Visualization
Paper2SysArch establishes a robust, reproducible, and extensible framework for transitioning from unstructured academic paper text to structured, editable, system-architecture artifacts. By providing the first comprehensive, large-scale evaluation benchmark and a baseline agent pipeline, it catalyzes progress in (1) automating labor-intensive diagram creation, (2) standardizing scientific visualization outputs, and (3) enabling deep meta-analysis and mining of system architecture trends across scientific corpora. The approach enables new applications in intelligent authoring environments, augmented reading interfaces, and large-scale architectural mapping in research analytics (Guo et al., 22 Nov 2025).