VFIG: Vision-Language Figure-to-SVG Conversion
- VFIG is a vision-language model family that converts raster figures into editable, semantically rich SVG code for scientific and technical diagrams.
- It leverages a coarse-to-fine supervised curriculum along with reinforcement learning to optimize structural fidelity and render quality.
- Supported by VFIG-DATA with 66K figure–SVG pairs and benchmarked on VFIG-BENCH, it sets a new standard for open-source diagram conversion.
Searching arXiv for the specific term and closely related papers to ground the article in current literature. VFIG is a family of vision-LLMs for figure-to-SVG conversion: given a rasterized figure such as a PNG or JPEG, it generates editable Scalable Vector Graphics code intended to reconstruct the figure while preserving structure, layout, text, and connector semantics. The system is designed for complex scientific and technical figures rather than generic raster tracing, and is introduced together with VFIG-DATA, a dataset of 66K curated figure–SVG pairs, and VFIG-BENCH, an evaluation suite for structural fidelity. Its training recipe combines a coarse-to-fine supervised curriculum with reinforcement learning refinement, and the reported VFIG-4B system reaches a VLM-Judge score of 0.829 on VFIG-BENCH, state-of-the-art among open-source models and close to GPT-5.2 on the same benchmark (He et al., 25 Mar 2026).
1. Task definition and representational scope
VFIG addresses the problem of recovering structured vector graphics from flattened raster figures. The paper motivates this problem by noting that SVG is resolution-independent, semantically editable, text-based, and widely supported, whereas original vector source files are often unavailable and only raster exports remain (He et al., 25 Mar 2026). The intended figures include scientific diagrams, architecture diagrams, flowcharts, process diagrams, and multi-panel technical illustrations.
The task is formulated as conditional generation. Given an input raster figure image , the model generates an SVG program , using paired training data
The output is treated as structured code and generated autoregressively. The model is prompted with instructions such as “Generate the SVG for this figure” or “Convert this figure into valid SVG code.” In deployment, the generated program is expected to contain a valid <svg>...</svg> block.
A central design choice is that VFIG is not aimed at path-heavy raster tracing. The paper emphasizes semantic primitives and structured editability. In both curation and evaluation, SVG elements are grouped into categories including basic primitives (rect, circle, ellipse), connectors (line, polyline), complex shapes (path, polygon), and text. This places the task closer to structured program synthesis than to contour approximation alone.
2. VFIG-DATA and data curation
VFIG-DATA combines real-world scientific figures with procedurally generated diagrams, reflecting the paper’s claim that the task is inherently data-driven but poorly served by prior small-scale datasets (He et al., 25 Mar 2026). The dataset contains 66K curated figure–SVG pairs, and the overall training mixture additionally includes 78K filtered examples from existing academic SVG datasets.
The real-world subset is built from Paper2Fig figures and a large-scale crawl of recent arXiv papers. The appendix states that 259,073 arXiv documents after January 2025 were queried, with figures extracted from LaTeX \includegraphics references and embedded PDFs rasterized using PyMuPDF. After image filtering, 45K high-quality figures were retained. Because these figures are often only available as raster assets, the paper uses a two-stage VLM pipeline for SVG creation: a description stage that produces a detailed structured account of shapes, colors, positions, text, styles, and relationships, followed by an SVG generation stage conditioned on both the original image and that description. The selected pipeline is described as Gemini-3-Pro Gemini-3-Pro, and in an internal human evaluation over 332 tasks, 1088 responses, and 5 annotators, it was preferred in 88.7\% of decisive comparisons over Gemini-3-Pro GPT-5.1.
The procedural subset is intended to supply precise supervision for attributes that are difficult to recover from raster figures alone. It uses 19 layout templates; with probability 0.3, two templates are combined by splitting the canvas and adding cross-connections. The shape vocabulary contains 18 total shapes: 12 flat shapes and 6 pseudo-3D shapes. Styling includes 7 fill styles, several border styles, randomized stroke widths, rounded corners with probability 0.6, and 1–2 font families sampled from a set of 8. Directed connections are sampled by
Arrows are straight 60\% of the time and quadratic Bézier 40\%, and are attached to true shape boundaries via geometric intersection or raycasting.
Filtering is used to suppress unsuitable figures and non-semantic SVG. Image filtering classifies figures into KEEP, IMAGE, MATH, and PLOT, retaining only KEEP. Code filtering defines
and keeps only SVGs satisfying
The associated structural descriptors are
0
where 1 is the number of text elements. These quantities are later reused in evaluation.
3. Model family and SVG generation pipeline
VFIG is presented as a family of fine-tuned VLMs rather than as a new architecture built from scratch (He et al., 25 Mar 2026). The evaluated backbones include Qwen3-VL-4B, Qwen3-VL-8B, Qwen2.5-VL-3B, and InternVL3.5-4B. The main open-source system is built on Qwen3-VL-4B-Instruct.
The adaptation strategy is deliberately narrow: the paper freezes the vision encoder and multimodal projector, and fine-tunes only the LLM parameters using LoRA. For the main Qwen3-VL-4B configuration, the reported SFT settings include LoRA rank 64, maximum sequence length 8192, image maximum pixels 262,144, learning rate 2e-5, per-device batch size 1, gradient accumulation 16, 3 epochs, cosine scheduling, warmup ratio 0.1, and bf16 precision.
Inference is autoregressive code generation. The maximum generation length is 8192 tokens, and greedy decoding (do_sample=False) is used by default because SVG is a structured program-synthesis task in which minor token errors can break syntax or renderability. After generation, the <svg>...</svg> content is extracted. The paper does not define a custom SVG grammar or tokenizer; a plausible implication is that task adaptation is achieved primarily through dataset design and objective shaping rather than through a specialized decoder formalism.
4. Coarse-to-fine supervised learning and RL refinement
The supervised stage optimizes the conditional likelihood
2
The distinctive aspect is the curriculum. Training begins on simpler diagrams and primitive-heavy data, then transitions to complex real scientific figures. The paper characterizes this as first learning “atomic primitives” and then refining global diagram composition. It explicitly states that direct training on complex figures from the start often causes unstable convergence, degenerate outputs, and difficulty in learning local primitives and global structure jointly (He et al., 25 Mar 2026).
Reinforcement learning is then applied using Group Relative Policy Optimization (GRPO). The objective is
3
Here 4 is the current policy, 5 is the frozen SFT policy, and the KL coefficient is 0.01. RL rollouts render sampled SVG candidates with CairoSVG and score them using Gemini-3-Flash as a judge. The reward is the unweighted average of four rubric components,
6
where the terms correspond to presence, layout, connectivity, and details. Malformed or unrenderable SVG receives zero reward.
The reported RL configuration uses 8 samples per prompt, learning rate 9e-6, entropy bonus 0.001, train batch size 64, validation batch size 64, optimization minibatch size 16, micro-batch per GPU 1, maximum prompt length 9000, maximum response length 8500, maximum model length 17,500, bf16 precision, 4× L40S hardware, and approximately 30 hours of training. On 100 annotated examples, the judge’s overall Pearson correlation with human judgments is reported as 0.89, with category-level correlations of 0.79 for presence, 0.63 for layout, 0.83 for connectivity, and 0.87 for details. This suggests that RL is being driven by a reward that is structurally aligned with the target task rather than by pixel similarity alone.
5. VFIG-BENCH and empirical performance
VFIG-BENCH is introduced because existing SVG benchmarks are described as being dominated by icons, emojis, and simple graphics rather than structurally complex scientific figures (He et al., 25 Mar 2026). The benchmark contains 392 realistic scientific figures held out from VFIG-DATA, and the evaluation protocol is complemented by 500 held-out samples from Molmo2-Diagram and 474 official test examples from SVG-Diagrams.
Evaluation is explicitly multi-granular. Rasterized predictions are scored with SSIM and LPIPS. Image-level feature similarity is summarized as VisualSim, defined as the average cosine similarity of DINO, CLIP, and SigLIP image embeddings. Holistic structural quality is measured by VLM-Judge, which is the mean score of Gemini and GPT judges over presence, layout, connectivity, and details. SVG code quality is measured using Clean and Render, the latter being the successful rendering rate.
The main VFIG-BENCH results are summarized below.
| Model | VLM-Judge | Note |
|---|---|---|
| OmniSVG-4B | 0.039 | Open-source baseline |
| Starvector-8B | 0.548 | Open-source baseline |
| Qwen3-VL-4B | 0.466 | Backbone without VFIG training |
| VFIG-4B (SFT) | 0.781 | Two-stage supervised model |
| VFIG-4B (SFT+RL) | 0.829 | Best open-source result |
| GPT-5.2 | 0.858 | Closed-source comparison |
For VFIG-4B (SFT+RL) on VFIG-BENCH, the paper reports SSIM 0.778, LPIPS 0.212, VisualSim 0.957, VLM-Judge 0.829, Clean 0.853, and Render 0.960. On Molmo2-Diagram, the same model achieves VLM-Judge 0.834; on SVG-Diagrams, 0.705. A larger Qwen3-VL-8B variant with two-stage SFT and RL reaches VLM-Judge 0.845 on VFIG-BENCH, with VisualSim 0.960.
The ablations attribute a substantial share of the gain to curriculum and RL. For Qwen3-VL-4B, one-stage SFT yields VLM-Judge 0.712, while two-stage SFT reaches 0.737, and render rate improves from 0.749 to 0.933. On VFIG-BENCH, VFIG-4B (SFT) scores 0.781, and VFIG-4B (SFT+RL) rises to 0.829. Reward ablations show that removing any of the four rubric components harms performance; among the reported variants, omitting layout or details produces some of the largest degradations.
The benchmark also highlights a methodological contrast with classical tracing. VTracer attains SSIM 0.950 and VLM-Judge 0.838 on VFIG-BENCH, but Clean 0.000, because it generates path-heavy non-semantic SVG. This illustrates the paper’s central distinction between visual resemblance and semantic editability.
6. Limitations, scope, and terminological ambiguities
VFIG is most directly targeted at diagram-centric figure reconstruction. The paper identifies scientific diagrams, architecture figures, flowcharts, pipeline diagrams, process illustrations, and technical block diagrams as strong application domains (He et al., 25 Mar 2026). It also states that the training and benchmark distributions are biased toward structured scientific diagrams rather than toward arbitrary vector graphics.
The reported weaknesses are concentrated in fine-grained fidelity. The paper explicitly notes difficulties with thin lines, arrowheads, small text, exact fonts, subtle styling, local geometric details, and 3D-like or perspective-like shapes. Font fidelity is described as consistently weak, and generalization to icons, logos, sketches, object drawings, and sparse geometric art is limited. The reward design is also acknowledged to be imperfect: equal weighting of presence, layout, connectivity, and details may not optimally reflect human preference, and the exploration of hybrid structural-plus-pixel rewards remains limited.
A further source of confusion is nomenclature. In current arXiv literature, VFIG refers to figure-to-SVG conversion with vision-LLMs (He et al., 25 Mar 2026). It is distinct from GVIF, the “Generative Visual Information Fidelity” metric for generative semantic communications (Huang et al., 15 May 2025); from VFG, the “Variational Flow Graphical Model” (Ren et al., 2022); from VFI systems such as “TLB-VFI” for video frame interpolation (Lyu et al., 7 Jul 2025); and from FIGS, the “Faint Infrared Grism Survey” in astronomy (Pirzkal et al., 2017). This suggests that, despite superficial acronym overlap, VFIG occupies a separate research line centered on structured visual program generation and editable vector reconstruction.