Papers
Topics
Authors
Recent
Search
2000 character limit reached

Diagram Parse Graphs (DPGs)

Updated 10 April 2026
  • DPGs are structured symbolic representations that encode diagram elements like blobs, text, and geometric primitives along with their interrelations.
  • They are constructed using detection networks and memory-based models that iteratively predict probabilistic relations among identified nodes.
  • DPGs enable multi-modal reasoning for applications such as question answering, knowledge extraction, and geometric theorem proving.

A Diagram Parse Graph (DPG) is a structured, typed graph representation of the constituent elements and relationships present within a diagram. DPGs serve as an explicit, symbolic abstraction bridging the gap between raw diagram imagery and higher-level reasoning tasks such as question answering, knowledge extraction, or automated theorem proving. By encoding both the entities (e.g., blobs, geometric primitives, text) and their interrelations (e.g., labeling, geometric predicates, causality), DPGs provide a modality-agnostic interface suitable for multi-modal diagram understanding in both generic and domain-specific contexts (Kembhavi et al., 2016, Kim et al., 2017, Hao et al., 2022, Zhang et al., 2022).

1. Formal Definition and Structural Variants

A DPG is formally defined as a (typed, attributed) graph:

G=(V,E)G = (V, E)

where VV denotes the set of nodes (diagram constituents) and EE the set of typed edges (relations among constituents).

Node Taxonomy:

  • Generic diagrams: Nodes correspond to blobs (illustrative objects), text boxes (OCR regions), arrow heads/tails, and domain-specific entities (Kembhavi et al., 2016, Kim et al., 2017).
  • Plane geometry diagrams: Nodes represent geometric primitives (points—including intersection, tangent, endpoint, and independent types; lines—solid, dashed, mixed; circles and arcs) and non-geometric primitives (symbols—perpendicular marks, bars, angle marks; text—labels, measures, etc.) (Hao et al., 2022, Zhang et al., 2022).

Edge Taxonomy:

Edges are directed and typed, encoding semantic or geometric relations:

  • Generic diagram types: Labeling (text→blob), linkage (arrow, connector), title/caption association, region labeling, etc. (Kembhavi et al., 2016).
  • Plane geometry types: Incidence (on-line, on-circle), center-of-circle, parallelism, perpendicularity, symbol→geometry (e.g., bar→equal-length), text→geometry (e.g., angle value→angle), text→symbol (Hao et al., 2022, Zhang et al., 2022).

Each edge e∈Ee\in E is typed and, when appropriate, may be represented as a tuple (subj, r, [obj])(\text{subj},\,r,\,\text{[obj]}), supporting relations of arity >2>2 in certain implementations (Hao et al., 2022).

Attributes (AVA_V, AEA_E) supplement node and edge sets with class, bounding-box/pixel mask, class confidence, OCR string, and geometric parameters (position, radius, etc.) (Kembhavi et al., 2016, Zhang et al., 2022).

2. Syntactic Parsing and DPG Construction Algorithms

Inferring a DPG from a raw diagram is a structured prediction problem requiring identification of diagram constituents, proposal of candidate relationships, and selection of the optimum relational structure matching the ground truth.

Generic Diagrams:

  • Object detection: SSD-style networks detect constituents, each with class, bounding box, and confidence (Kim et al., 2017).
  • Relationship proposal: Candidate (typically all) pairs or higher-arity groups are generated for edge classification (Kembhavi et al., 2016).
  • DPG Construction:
    • Dynamic Graph Generation Network (DGGN) processes candidate pairs using a gated recurrent unit (GRU) architecture with global context and a dynamic adjacency tensor memory (Kim et al., 2017). Each pair (i,j)(i,j) is updated iteratively, allowing probabilistic edge formation.
    • DSDP-Net (LSTM-based) consumes a sequence of relationship proposals, maintaining memory across previously accepted or rejected edges and producing a structured parse of viable edges (Kembhavi et al., 2016).
  • Training: Losses integrate object detection and edge prediction (softmax/CE); proposals are matched to ground truth via IoU thresholds and sampling for positive and negative relationships (Kim et al., 2017).

Geometry Diagrams:

  • Primitive extraction: Segmentation and detection modules (e.g., instance segmentation for points, lines, circles; FCOS/Mask-RCNN for symbols/text) localize and classify primitives (Hao et al., 2022, Zhang et al., 2022).
  • Feature extraction: Primitive-level features include visual (FPN/ROI-aligned), location (coordinate/radius), and class semantic embeddings (Zhang et al., 2022).
  • Relation parsing: Edge proposals filtered by geometric priors (e.g., only allow point–line incidence, line–line parallelism/perpendicularity). GNNs apply attention-based message passing, producing relation existence probabilities (Zhang et al., 2022).
  • Multi-task end-to-end training: Detection, segmentation, and relational reasoning losses are balanced to improve joint performance and minimize error propagation (Zhang et al., 2022).

3. Edge and Node Typologies Across Domains

The permitted primitive and relation types—and their compositional rules—are domain-dependent.

Domain Primitives (Nodes) Relations (Edges)
Generic/Science Blob, text box, arrow tail, arrow head Labeling, intra/inter-object linkage, arrow assignment, region, caption, title (Kembhavi et al., 2016)
Plane Geometry Points (subtypes), lines (subtypes), circles/arcs, symbols (16+), text (6+) On-line, on-circle, center, parallel, perpendicular, bar, angle, text/label-to-primitive, symbol-to-primitive (Hao et al., 2022, Zhang et al., 2022)

In geometric DPGs, edge types are curated to align with formal geometric relations, supporting downstream automated reasoning and proposition generation in languages such as the Geometric Description Language (GDL) (Hao et al., 2022).

4. Annotation, Datasets, and Evaluation

Several major datasets support DPG research, providing annotated diagrams with primitive-level and relation-level ground truth.

AI2 Diagrams dataset (Kembhavi et al., 2016):

  • 5,000 diagrams (train/val/test split), 118k nodes, 53k relation instances, 15k associated multiple-choice QA pairs. Comprehensive annotation covers blobs, text regions, arrow heads/tails, and ten relation types.

PGDP5K (Hao et al., 2022):

  • 5,000 plane geometry diagrams, 16 shape types, 5 positional relations, 22 symbol types, 6 text types. Fine-grained primitive and relation annotation at the pixel/instance level. Annotation pipeline combines semi-automatic extraction and manual correction for robust labeling.

IMP-Geometry3K (used in benchmarking) (Hao et al., 2022):

  • Existing geometry parsing dataset; for comparison, PGDP5K is larger and more finely annotated.

Metrics: Precision, recall, F1 (at 15px match for primitives), string-matching for generated propositions. Performance results indicate F1 ≈ 86% for circles, ≈76–77% for points and lines, and overall GDL-proposition generation F1 = 66.07% on PGDP5K—substantially lower than on less challenging testbeds (Hao et al., 2022).

Bottlenecks and Challenges:

  • Long-tailed distributions for rare classes/relations.
  • Complex layouts, overlapping or low-contrast primitives.
  • Subtle variations in symbol rendering.
  • Hidden or ambiguous relationships (e.g., tangent points, multi-arc geometries) (Hao et al., 2022).

5. Reasoning Applications and Semantic Leveraging

DPGs are foundational for diagram-centric reasoning in both the vision and mathematical domains.

  • QA over science diagrams: DPG-based attention enables alignment of textual questions/answers with diagram fact strings via embedding similarity, with question–answer pairs mapped to DPG relations (Kembhavi et al., 2016).
  • Plane geometry theorem proving: The explicit, fine-grained nature of geometric DPGs permits conversion to proposition templates (PointLiesOnLine, Perpendicular, etc.). These can be fed into downstream provers or symbolic reasoning modules (Hao et al., 2022).
  • Semantic integration: DPGs unify multi-modal evidence (visual, textual, symbolic) within a coherent symbolic intermediate, facilitating downstream applications and enabling modularity in complex reasoning pipelines (Kim et al., 2017).

6. Model Families, Implementation Details, and End-to-End Examples

Generic DPG Parsing:

  • UDPNet: SSD backbone for detection; DGGN (GRU+dynamic memory) for edge prediction; joint loss over detection and relation existence (Kim et al., 2017).
  • DSDP-Net: Two-layer LSTM parses sequential candidate relations; end-to-end cross-entropy optimization (Kembhavi et al., 2016).

Plane Geometry Diagram Parsing:

  • PGDPNet: MobileNetV2+FPN backbone, FCOS detection for text/symbols, instance segmentation for geometry, EGAT-GNN for relation parsing, multi-branch loss aggregation for robust end-to-end supervision (Zhang et al., 2022).

Step-wise Example (summarized from (Kim et al., 2017)):

A pulley diagram is processed by SSD to detect constituent blobs and text. Each node is paired, features are extracted, and DGGN iteratively proposes edge existence, resulting in a DPG where, e.g., (blob@pulley → text@Pulley) is formed if above-threshold, and similarly for other constituent links.

7. Impact, Limitations, and Future Directions

DPGs formalize diagram understanding as a symbolic structure prediction task, supporting advances in vision-language reasoning, scientific illustration parsing, and geometric theorem-proving. The move from brittle rule-based systems to deep-learning-based, jointly optimized DPG extractors enables better generalization, scalability, and applicability in complex real-world diagrams (Kim et al., 2017, Hao et al., 2022).

Future research directions include:

  • Improving geometric primitive recognition, especially under occlusion and low contrast.
  • Relation parsing under long-tail class distributions and symbol variability.
  • Seamless integration of DPGs with downstream reasoning (e.g., theorem provers, multi-hop QA).
  • Development of larger and even more diverse annotated datasets.
  • Cross-domain DPG transfer and adaptation.

DPGs provide a critical structured interface for diagram-centric scientific AI, forming the backbone of multi-modal visual reasoning frameworks (Kembhavi et al., 2016, Kim et al., 2017, Hao et al., 2022, Zhang et al., 2022).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Diagram Parse Graphs (DPGs).