VFig-Data: Diagram-to-SVG Corpus

Updated 4 July 2026

VFig-Data is a paired raster–vector dataset comprising 66,500 high-quality image/SVG pairs for complex diagram vectorization.
It uses a two-stage real-world curation and an elaborate synthetic generation pipeline to ensure clean, structurally rich diagrams.
The corpus supports supervised training and benchmarking with robust metrics, aiding advancements in figure-to-SVG conversion tasks.

VFIG-DATA, more precisely VFig-Data, is a large-scale corpus of 66 500 high-quality image/SVG pairs introduced with “VFIG: Vectorizing Complex Figures in SVG with Vision-LLMs” for figure-to-SVG conversion of complex, high-fidelity diagrams (He et al., 25 Mar 2026). It was constructed to address the mismatch between the data demands of figure vectorization and the limitations of earlier datasets, which are described as typically small-scale and lacking the complexity of professional diagrams. The corpus combines approximately 6 500 real-world paper figures with approximately 60 000 programmatically generated diagrams, and it is situated within a broader framework that also includes a coarse-to-fine training curriculum and the VFIG-BENCH evaluation suite (He et al., 25 Mar 2026).

1. Corpus definition and source composition

VFig-Data is defined as a paired raster–vector dataset in which each example associates a rasterized figure image with its corresponding clean SVG representation. The paper positions it as the data foundation for recovering editable vector structure from “flat” raster figures such as PNG or JPEG files (He et al., 25 Mar 2026).

Its core composition is bimodal. The real-world subset contains approximately 6 500 examples harvested from arXiv (post-2025) and the Paper2Fig corpus. The programmatically generated subset contains approximately 60 000 examples produced by rendering synthetic SVGs to PNG. The real-world component targets flowcharts, AI-architecture diagrams, multi-panel scientific illustrations, process schematics, and block-diagram pipelines. The synthetic component covers block graphs, pipelines, network motifs, 3D primitives (cylinders, prisms), “blob” and “cloud” shapes, and abstract data-flow layouts (He et al., 25 Mar 2026).

Source	Approximate count	Content
Real-world paper figures	6 500	arXiv and Paper2Fig diagrams
Programmatically generated diagrams	60 000	Synthetic SVGs rendered to PNG
Separate training-only SVG mix	78 K	StarVector and Molmo2-Diagram

The ≈ 78 K academic SVGs from StarVector and Molmo2-Diagram are explicitly described as mixed into model training but separate from the 66 K VFig-Data core. This distinction is important because it separates the dataset proper from auxiliary training resources.

A central design feature is the emphasis on diagram classes with explicit geometric and topological structure. This is reflected in the real-world filtering policy, the synthetic generation process, and the later use of structure-oriented evaluation metrics. A plausible implication is that VFig-Data is intended less as a generic vectorization corpus than as a specialized resource for diagrammatic figures with recoverable compositional structure.

2. Curation methodology and synthetic generation

The real-world subset is curated through a two-stage procedure. First, input figures are classified by the VLM Gemini-3-Flash into four categories: KEEP, IMAGE, MATH, and PLOT. Only KEEP images are retained. Second, SVG outputs are code-filtered by grouping geometric elements into basic primitives $B=\{\text{rect},\text{circle},\text{ellipse}\}$ , connectors $K=\{\text{line},\text{polyline}\}$ , and complex shapes $C=\{\text{path},\text{polygon}\}$ . The enforced constraints are:

$(B+K)/N \ge 0.40$
$C \le 50$
$N = B+K+C$

Light cleaning then normalizes coordinate precision, removes redundant metadata, and standardizes viewBox/canvas settings (He et al., 25 Mar 2026).

The synthetic subset is generated through a more elaborate procedural pipeline. Layouts are selected from 19 human-inspired templates; with probability 0.3, two templates are fused into left/right halves with random cross-links. Shape placement uses 18 shape types (12 flat shapes + 6 pseudo-3D), with each diagram drawing from a pool of 2–3 types, and placement is randomized subject to AABB collision avoidance. With probability 0.15, shapes are stacked 2–4 deep to simulate a 3D effect (He et al., 25 Mar 2026).

Styling is also randomized. Each shape samples one of 7 fill patterns—solid, hatching, dots, crosshatch, horizontal lines, linear gradient, radial gradient—together with one of 4 border dash styles—solid, dashed, dotted, dash–dot. Stroke widths, corner radii, and color palettes are varied, with $p=0.6$ for rounding. Text labels are drawn from a domain-specific lexicon and rendered in 1–2 fonts drawn from 8 families (He et al., 25 Mar 2026).

Connection routing is parameterized by the number of shapes $n$ and a sampled number of directed edges

$c \sim \mathrm{Uniform}(\lfloor n r_l \rfloor,\lfloor n r_h \rfloor), \qquad r_l \sim \mathrm{Uniform}(0.4,0.6), \qquad r_h \sim \mathrm{Uniform}(0.6,0.8).$

Template hints supply the first links, and the remainder are random unique pairs. Arrows are straight (60%) or quadratic Bézier (40%), with sampled stroke width, head size, and dash style, and they attach at shape boundaries via analytical ray-casting or parametric intersection (He et al., 25 Mar 2026).

This generation procedure makes explicit that the dataset is not merely a collection of paired files; it is a controlled distribution over layout, geometry, styling, and connectivity. That, in turn, explains why the dataset supports both supervised fitting of local SVG primitives and later optimization of global structural fidelity in the associated VFIG training pipeline (He et al., 25 Mar 2026).

3. File organization and annotation schema

Each example consists of a raster input and a vector target. The technical summary specifies the following file-level organization:

image.png: raster input, 512×512 or similar
ground-truth.svg: cleaned SVG code
metadata.json: synthetic subset only

A suggested directory layout places real and synthetic subsets in separate subtrees, with images, SVGs, and synthetic metadata stored in distinct folders. Because the summary explicitly labels this layout as “suggested,” it is best understood as a conventional organization rather than a mandatory on-disk specification.

The synthetic subset includes structured metadata per shape and per arrow. The schema includes:

shape attributes: shape_id, type, fill_color, stroke_color, fill_style, border_style, font_class, bbox, aspect_ratio
arrow attributes: arrow_id, src_shape_id, dst_shape_id, head_present, head_size, is_curved, overlap_penalty

Allowed values are also partially specified. For example, fill_style ranges over solid, dots, hatching, crosshatch, gradient_…, border_style over solid, dashed, dotted, dash-dot, and font_class over serif, sans-serif, monospace (He et al., 25 Mar 2026).

The presence of per-shape and per-arrow metadata only for the synthetic subset is methodologically significant. It preserves paired supervision across the full corpus while providing richer structural annotations where exact ground-truth generation histories are available. This suggests a natural division between direct paired supervision on all samples and fine-grained structural supervision or analysis on the synthetic portion.

4. Statistical characterization and quality metrics

VFig-Data is statistically characterized at both the corpus and element levels. Averaged over the 66 K examples, the mean primitive counts per SVG are:

Basic primitives $B=(\text{rect}+\text{circle}+\text{ellipse})$ : mean ≈ 30
Connectors $K=\{\text{line},\text{polyline}\}$ 0: mean ≈ 10
Complex shapes $K=\{\text{line},\text{polyline}\}$ 1: mean ≈ 5
Text elements $K=\{\text{line},\text{polyline}\}$ 2: mean ≈ 12 (He et al., 25 Mar 2026)

The all-corpus histogram summary gives geometric elements $K=\{\text{line},\text{polyline}\}$ 3 with $K=\{\text{line},\text{polyline}\}$ 4 and $K=\{\text{line},\text{polyline}\}$ 5, and text elements with $K=\{\text{line},\text{polyline}\}$ 6 and $K=\{\text{line},\text{polyline}\}$ 7. The SVG file-size statistics are 18 KB ( $K=\{\text{line},\text{polyline}\}$ 8 KB) for real-world examples and 9 KB ( $K=\{\text{line},\text{polyline}\}$ 9 KB) for synthetic examples (He et al., 25 Mar 2026).

For the real-world subset only (6.5 K examples), the figure-type distribution is:

Figure type	Share
Flowcharts & pipelines	32%
Neural-network architectures	28%
Multi-panel illustrations	20%
Process schematics & block diagrams	12%
Other	8%

The corpus also defines explicit quality-oriented metrics. Element Complexity is

$C=\{\text{path},\text{polygon}\}$ 0

with reported $C=\{\text{path},\text{polygon}\}$ 1 and $C=\{\text{path},\text{polygon}\}$ 2 on the log scale. Semantic Cleanliness and Path Dominance are

$C=\{\text{path},\text{polygon}\}$ 3

The cleanliness distribution is summarized as $C=\{\text{path},\text{polygon}\}$ 4 and $C=\{\text{path},\text{polygon}\}$ 5, meaning that roughly 80% of elements are primitives or connectors (He et al., 25 Mar 2026).

These metrics are directly aligned with the curation policy. High cleanliness and controlled path dominance favor diagrams whose structure remains editable and semantically decomposable in SVG. The summary also lists a Diversity Score,

$C=\{\text{path},\text{polygon}\}$ 6

but labels it “optional; not in paper but commonly used.” This suggests that EC, Clean, and PD are the dataset’s canonical metrics, whereas the diversity score should not be treated as a core paper-defined measure.

5. Accessibility, licensing, uses, and limitations

The technical summary describes VFig-Data as licensed under Creative-Commons CC BY 4.0, with dataset, code, and evaluation scripts publicly available at https://vfig-proj.github.io/ and direct download via GitHub/GCP bucket links. It further states that users must cite He et al. (2025) when using VFig-Data in publications and that there is no non-commercial restriction; redistribution permitted under CC BY 4.0 terms.

The enumerated use cases include:

training and evaluation of vision–LLMs for figure-to-SVG conversion
pretraining vector-graphic synthesis systems
data augmentation for scientific-figure editing tools
benchmarking multi-modal program synthesis models on long-horizon code generation (He et al., 25 Mar 2026)

The dataset is introduced together with a model family and evaluation suite. In that broader setting, VFIG is reported to achieve state-of-the-art performance among open-source models and to perform on par with GPT-5.2, with a VLM-Judge score of 0.829 on VFIG-BENCH (He et al., 25 Mar 2026). Although that number is a model result rather than a dataset statistic, it clarifies the empirical role of VFig-Data within the full system.

The stated limitations are equally important. The corpus is described as heavily skewed toward scientific/engineering diagrams, with poor coverage of iconography, natural-scene vectorization, or highly textured SVG art. It also has limited font variety (8 families) and fill styles (7 types), and the real-world subset filtered out mathematical-equation-only figures and plots (He et al., 25 Mar 2026). A plausible implication is that models trained primarily on VFig-Data may generalize best to diagrammatic scientific figures and less well to plot-centric, equation-centric, or stylistically idiosyncratic vector graphics.

6. Terminological ambiguity and other uses of “VFIG-DATA”

In the supplied arXiv-derived materials, the label “VFIG-DATA” is not unique to the 2026 figure–SVG corpus. It is also attached to three unrelated technical objects.

First, it is used for a technical summary of the Faint Infrared Grism Survey (FIGS), a deep HST WFC3/IR G102 slitless spectroscopic survey of four deep fields with continuous $C=\{\text{path},\text{polygon}\}$ 7– $C=\{\text{path},\text{polygon}\}$ 8 coverage, $C=\{\text{path},\text{polygon}\}$ 9 ks net on-grism integration per field, and a $(B+K)/N \ge 0.40$ 0 continuum limit of approximately 26 AB mag (Pirzkal et al., 2017). That usage concerns astronomy data products, extraction pipelines, and FITS delivery conventions rather than SVG vectorization.

Second, the label appears in a summary of FSVVD, a point-cloud-based volumetric-video dataset with 26 sequences, 4 categories, and capture using six Microsoft Azure Kinect sensors arranged perimeter-wise (Hu et al., 2023). That usage belongs to immersive multimedia and 3D scene capture.

Third, it is used for a data-valuation framework derived from “What is my data worth? From data properties to data value”, where data value is expressed as

$(B+K)/N \ge 0.40$ 1

and computed over 16 high-level facets (Kannan et al., 2018). That usage concerns multi-attribute utility over dataset properties, not figure datasets.

This multiplicity of senses indicates that “VFIG-DATA” is not a semantically stable cross-domain term in secondary summaries. In current research usage tied to the named 2026 VFIG system, however, VFig-Data denotes the 66 500-pair figure–SVG corpus for complex diagram vectorization (He et al., 25 Mar 2026).