VFIG-BENCH: Complex Figure-to-SVG Benchmark
- VFIG-BENCH is a unified benchmark for evaluating the quality of complex figure-to-SVG conversion with emphasis on fidelity, structure, and editability.
- It employs multi-granularity metrics including pixel-level similarity, component-level accuracy, holistic VLM judgment, and code-level cleanliness.
- The benchmark enables direct performance comparisons of state-of-the-art models using a diverse dataset of real-world scientific figures and structured evaluation protocols.
VFIG-BENCH is the primary evaluation component of the VFIG framework for complex figure-to-SVG conversion, introduced in "VFIG: Vectorizing Complex Figures in SVG with Vision-LLMs" (He et al., 25 Mar 2026). Its stated purpose is to provide a unified, multi-granularity benchmark for measuring the fidelity, structural integrity, and editability of generated SVG programs for complex scientific figures. Built on the held-out split of VFIG-DATA, which contains 66K high-quality figure-SVG pairs curated from a diverse mix of real-world paper figures and procedurally generated diagrams, the benchmark targets settings in which rasterized scientific figures must be reconstructed as semantically meaningful SVG rather than merely reproduced as images.
1. Position within the VFIG framework
VFIG-BENCH is presented as the primary evaluation suite of the broader VFIG system (He et al., 25 Mar 2026). The surrounding framework addresses figure-to-SVG conversion with a family of Vision-LLMs trained for complex and high-fidelity reconstruction. Within that framework, VFIG-DATA supplies training and held-out evaluation material, the model training follows a coarse-to-fine curriculum, and VFIG-BENCH provides the mechanism for assessing outputs at multiple granularities.
The benchmark is explicitly motivated by the structure of SVG programs and the limitations of prior evaluation setups. The paper states that existing datasets are typically small-scale and lack the complexity of professional diagrams, whereas VFIG-BENCH is designed for complex scientific figures involving flowcharts, neural-network architectures, multi-panel layouts, dense annotations, and precise connectivity. This suggests that the benchmark is intended not only to measure pixel resemblance but also to evaluate whether a generated SVG preserves diagrammatic organization and editability under realistic document-figure conditions.
2. Benchmark corpus and complexity regime
VFIG-BENCH consists of 392 real-world paper figures drawn from the held-out split of VFIG-DATA’s real-world subset (He et al., 25 Mar 2026). The figures cover flowcharts, neural-network architectures, multi-panel layouts, dense annotations, and precise connectivity. The benchmark therefore emphasizes heterogeneous structural phenomena rather than a single diagram family.
The evaluation section also lists two additional benchmarks for cross-evaluation: Molmo2-Diagram with 500 samples and SVG-Diagrams with 474 samples. In the benchmark description, figure diversity includes single-panel to multi-panel layouts, diverse shapes including flat and pseudo-3D forms, variable connection density, and dense text annotations. Complexity levels are controlled by element counts, described as element complexity up to , and by structural composition, described as SVG hierarchical depth.
These design choices indicate that VFIG-BENCH treats complexity as both geometric and compositional. A plausible implication is that models are evaluated not only on local primitive recovery but also on whether they can preserve nested layout structure and connector semantics as the number of elements and annotations increases.
3. Metric hierarchy
VFIG-BENCH evaluates generated SVGs through pixel-level, component-level, image-level, and code-level metrics (He et al., 25 Mar 2026). The benchmark description emphasizes that these metrics are complementary and jointly characterize visual fidelity, component correctness, and holistic diagram quality.
| Category | Metrics | Primary target |
|---|---|---|
| Pixel-level | SSIM, LPIPS, VisualSim | Rendered image fidelity |
| Component-level | Shape Composite , Arrow Composite | Rule-based structural correctness |
| Image-level | VLM-Judge | Holistic semantic and structural quality |
| Code-level | Cleanliness, Render Success Rate | SVG validity and semantic primitiveness |
At the pixel level, Structural Similarity Index measures local luminance, contrast, and structure agreement between a reference image and rendered prediction :
Learned Perceptual Image Patch Similarity computes a learned distance between deep-feature activations of and :
VisualSim is defined as the average cosine similarity of three image encoders—DINO, CLIP, and SigLIP:
At the component level, the benchmark applies rule-based metrics when ground-truth structural metadata is available, for example on the programmatic subset. The Shape Composite 0 matches each ground-truth shape to a generated element and scores nine attributes: Label (Lbl), Type (Typ), Fill Color (FC), Fill Style (FS), Stroke Color (SC), Border Style (BS), Position (Pos), Font (Fnt), and Aspect Ratio (AR). With attribute set 1 and attribute score 2,
3
The Arrow Composite 4 matches each ground-truth directed edge to a generated connector and scores seven attributes: Source (Src), Destination (Dst), Arrowhead presence (Hd), Head size (Sz), Curvature (Cv), Color (Col), and Overlap penalty (Ovl). With attribute set 5,
6
At the image level, VFIG-BENCH employs a rubric-based vision-language judge, specified as Gemini-3-Flash or GPT-5.2, to assess holistic semantic and structural correctness on complex real-world figures. For each input 7 and rendered output 8, the judge returns four scores in 9: Presence, Layout, Connectivity, and Details. The final VLM-Judge score is their unweighted average:
0
At the code level, SVG Cleanliness is defined using
1
with 2, and
3
The metric measures the fraction of semantic primitives versus free-form paths. Render Success Rate is defined as the fraction of generated SVG programs that pass XML parsing and render without error under CairoSVG.
4. Evaluation protocol
The evaluation pipeline begins with a raster input 4, from which a model generates SVG code 5 (He et al., 25 Mar 2026). The benchmark then extracts the <svg>…</svg> block; if the block is missing or unparseable, the sample is marked as a render failure. If parsing succeeds, the generated SVG is rendered with CairoSVG at the original resolution to produce 6.
After rendering, the pipeline computes pixel-level metrics—SSIM, LPIPS, and VisualSim—followed by a VLM-Judge query that returns 7, 8, 9, and 0, from which 1 is computed. The pipeline also extracts code-level metrics, namely Cleanliness and Render Rate. On programmatic subsets, it additionally computes the component-level composites 2 and 3.
The pseudocode in the benchmark description makes the failure policy explicit. Invalid SVG predictions are recorded with render_ok = False, SSIM = 0, LPIPS = ∞, VisualSim = 0, VLM_Judge = 0, and Clean = 0, after which processing continues to the next sample. All recorded metrics are then aggregated by mean and standard deviation. This makes renderability a first-class property of evaluation rather than a separate preprocessing check.
5. Reported results on VFIG-BENCH
The benchmark description reports the following example results on VFIG-BENCH for the Qwen3-VL-4B model after supervised fine-tuning followed by rendering-aware reinforcement learning, denoted SFT+RL (He et al., 25 Mar 2026).
| Metric | Reported value |
|---|---|
| SSIM4 | 0.778 |
| LPIPS5 | 0.212 |
| VisualSim6 | 0.957 |
| VLM-Judge7 | 0.829 |
| Cleanliness8 | 0.853 |
| Render Rate9 | 0.960 |
The same passage states that these numbers demonstrate state-of-the-art performance among open-source models and competitive parity with GPT-5.2, whose VLM-Judge score is given as 0.858, and Gemini-3-Flash, whose VLM-Judge score is given as 0.913. The abstract separately states that VFIG performs on par with GPT-5.2 and achieves a VLM-Judge score of 0.829 on VFIG-BENCH. Taken together, the reported comparison emphasizes that the benchmark is used to position open-source figure-to-SVG systems against both open and proprietary VLM baselines using a common metric suite.
6. Interpretation and scope of assessment
The defining feature of VFIG-BENCH is its combination of rendered-image fidelity, explicit structural matching, holistic judge-based assessment, and code-level validity checks (He et al., 25 Mar 2026). The benchmark summary describes this as a reproducible, fine-grained evaluation suite that holistically assesses the quality of complex figure-to-SVG generation systems.
This structure is important because the benchmark is not limited to a single notion of correctness. Pixel-level agreement can capture local visual resemblance, but VLM-Judge targets semantic and structural properties such as presence, layout, connectivity, and details, while the rule-based composites quantify shape and arrow attributes when structural metadata is available. The Cleanliness metric adds a distinct concern: whether the SVG is expressed through semantic primitives rather than dominated by free-form paths. This suggests that VFIG-BENCH operationalizes editability indirectly through code structure as well as through rendered correctness.
A plausible misconception is that figure-to-SVG evaluation can be reduced to raster reconstruction alone. VFIG-BENCH is explicitly constructed against that reduction: it evaluates whether a model produces parseable SVG, whether the SVG renders successfully under CairoSVG, whether the rendering is visually faithful, whether components and connectors match structured ground truth when available, and whether a VLM judge finds the output semantically and structurally correct. In that sense, the benchmark is tailored to complex scientific figures rather than generic image vectorization.