CraftEditor: Raster-to-SVG Figure Reconstruction

Updated 4 July 2026

CraftEditor is a system that transforms complex scientific raster figures into structured, coordinate-faithful SVGs for detailed local editing.
It employs a three-phase pipeline—Extraction, Processing, and Composition—to recover semantic components and ensure layout fidelity.
Empirical evaluations show that CraftEditor outperforms baselines on key metrics such as position, text, and style, thanks to its iterative refinement and structured verification.

Searching arXiv for the named CraftEditor paper and closely related editor research to ground the article. CraftEditor is a raster-to-vector companion system for scientific figures that converts a raster scientific figure into a coordinate-faithful, editable SVG. It is presented as the “editing half” of the Crafter framework: Crafter generates publication-quality raster figures from diverse inputs, while CraftEditor reconstructs those rasters as structured SVG compositions that can be revised locally. The system is motivated by the observation that scientific figures are structured compositions of discrete semantic components—such as icons, labels, arrows, boxes, panels, and annotations—and that raster outputs, even when visually strong, “cannot be locally revised” without burdensome manual rework (Zhao et al., 28 May 2026).

1. Origins and problem setting

CraftEditor was proposed for a specific scientific-illustration workflow: a generated figure is often mostly correct, but still requires localized edits. The paper identifies recurring needs to fix or replace a label, swap an icon, adjust color schemes, move or resize a component, preserve part of an existing figure while filling in another part, and complete a rough or partial layout. These needs are especially acute for scientific figures because they are not treated as undifferentiated images; rather, they are structured layouts whose errors are often local rather than global (Zhao et al., 28 May 2026).

The system is positioned against two limitations of prior scientific figure systems. First, such systems usually target a single figure type under text-only input. Second, even when they produce strong visual outputs, those outputs are rasters, which are hard to revise locally. CraftEditor addresses the second limitation by transforming static figure rasters into editable SVGs, thereby enabling element-level editing instead of total regeneration. The paper frames this not as a cosmetic conversion problem but as a reconstruction problem over semantic parts and layout structure (Zhao et al., 28 May 2026).

This framing distinguishes CraftEditor from conventional vectorization. The goal is not merely to trace edges or produce scalable paths. The goal is to recover a figure as a structured composition whose components can be edited, rearranged, recolored, relabeled, or completed while remaining faithful to the original raster. That emphasis on coordinate-faithful editability is the defining characteristic of the system (Zhao et al., 28 May 2026).

2. Harness architecture and formalization

CraftEditor is implemented as a specialized instance of the same multi-agent harness abstraction used by Crafter. The shared loop is defined over an evolving specification $\mathcal{S}$ :

$p_t = \mathcal{D}(\text{input},\;\mathcal{S}_{t-1}), \qquad a_t = \mathcal{E}(p_t),$

$d_t = \mathcal{V}(a_t,\;\text{input},\;\mathcal{S}_{t-1}), \qquad \mathcal{S}_t = \mathcal{R}(d_t,\;\mathcal{S}_{t-1}),$

with final selection

$a^{*}\!=\!\arg\max_\tau\;\mathrm{score}(d_\tau).$

Within this abstraction, $\mathcal{D}$ is the Designer, $\mathcal{E}$ the Executor, $\mathcal{V}$ the Verifier, and $\mathcal{R}$ the Reviser. Both Crafter and CraftEditor therefore use iterative plan–execute–verify–revise behavior, typed corrections rather than unstructured free-text accumulation, and best-so-far reversion to handle non-monotonic refinement (Zhao et al., 28 May 2026).

CraftEditor differs from Crafter in the identity of the executor and the artifact type. In Crafter, the executor is an image-generation backend and the artifact is raster imagery. In CraftEditor, the executor is code-generation or element-injection logic and the artifact is SVG code. The paper gives the concrete role mapping for CraftEditor as follows: the Designer is an SVG skeleton generator, the Executor is element-injection code, the Verifier is a hybrid critic, and the Reviser is an SVG editor. CraftEditor is therefore not simply Crafter with an export option; it is a separate harness instantiation specialized for raster-to-SVG reconstruction (Zhao et al., 28 May 2026).

The paper presents CraftEditor as effectively training-free and prompting-based orchestration. It does not describe any dedicated model training, loss function, fine-tuning, or learned optimization specific to CraftEditor. Instead, it relies on prompting, external VLMs and LLMs, image editing services, segmentation and background removal tools, and code generation and revision. A plausible implication is that its main research contribution lies in decomposition, coordination, and verification rather than in a new learned editor backbone (Zhao et al., 28 May 2026).

3. Three-phase raster-to-SVG pipeline

CraftEditor converts a raster figure $a^*$ into an editable SVG $\mathbf{v}$ through three phases: Extraction, Processing, and Composition. The first and third phases instantiate the harness loop directly (Zhao et al., 28 May 2026).

The Extraction phase is an instruction-driven canvas cleaning stage. Its purpose is to separate meaningful visual elements from clutter so that downstream SVG construction begins from clean assets rather than noisy crops. The paper emphasizes that scientific figures, especially posters with 25 to 50 visual assets, contain overlapping elements, text overlays, heterogeneous backgrounds, and semantically mixed regions, making off-the-shelf segmentation unreliable. Instead of one-shot segmentation, CraftEditor uses a loop in which a vision-language Designer inspects the raster and writes a keep/delete plan, an instructable image editor applies the plan at the pixel level, a Verifier inspects the cleaned result, and a Reviser updates the plan from directive diagnostics. The phase runs for at most $p_t = \mathcal{D}(\text{input},\;\mathcal{S}_{t-1}), \qquad a_t = \mathcal{E}(p_t),$ 0 iterations. The appendix provides an example verifier diagnostic: “the bottom-row icons were over-deleted; restore them; remove the page number instead.” After cleaning, per-element assets are cropped, and a hallucination filter removes blank extractions, mismatched extractions, and text-only extractions. On dense posters, extraction convergence is reported as 47% in round 1, 46% in round 2, and 7% in round 3 (Zhao et al., 28 May 2026).

The Processing phase performs captioning, grounding or referring, and vector-versus-raster classification over each extracted element. Although this stage is described more briefly than the other two, it is structurally important because it constructs the semantic inventory used in composition. The paper implies intermediate representations that include a cleaned canvas, per-element asset inventory, element captions or semantic descriptions, element grounding information, vector/raster type labels, and an SVG skeleton with placeholders. This means that CraftEditor is not merely tracing geometry; it is assembling a component-level representation of the figure (Zhao et al., 28 May 2026).

The Composition phase is an iterative SVG assembly process. It is introduced because a one-shot LLM call to write an SVG from an element inventory tends to fail in predictable ways, including wrong grid topology, wrong arrow endpoints, text labels that disagree with the input raster, and general layout mismatch. CraftEditor addresses this with a second harness loop. The Designer generates two candidate SVG skeletons at decoding temperatures 0.20 and 0.45. A convergence judge selects the better skeleton using rapid visual comparison. The Executor then splices extracted assets into placeholders. The Verifier is a hybrid critic with two channels: a VLM channel for global layout fidelity and semantic correspondence to the original raster, and programmatic checkers for structural properties that VLMs often miss, including text overflow, arrow-endpoint accuracy, element overlap, and missing components. The appendix reports per-axis critic signals including text presence, arrow endpoints, layout consistency, and color drift. The Reviser then modifies the SVG source code according to the diagnostic. The loop runs up to $p_t = \mathcal{D}(\text{input},\;\mathcal{S}_{t-1}), \qquad a_t = \mathcal{E}(p_t),$ 1 rounds and uses best-so-far reversion because refinement is non-monotonic; the appendix states that without best-so-far reversion, about 30% of refinement iterations score lower than the immediately preceding iteration (Zhao et al., 28 May 2026).

4. Editability, semantic components, and SVG output

CraftEditor’s notion of editability depends on recovering semantic components rather than flattening the figure into a single embedded image. The paper states that semantic parts are identified through the combination of instruction-driven cleaning, per-element cropping, hallucination filtering, captioning, grounding, and vector/raster classification. Those parts are then placed into an SVG skeleton and iteratively repaired (Zhao et al., 28 May 2026).

The resulting SVG is intended to support element-level editing such as swapping icons, adjusting labels, changing colors, rearranging components, and completing partial diagrams. The evaluation axes further imply explicit reconstruction of positions, colors, text, icons, arrows, and style. This makes the output suitable for local revision in a way that raster outputs are not (Zhao et al., 28 May 2026).

The paper is careful, however, not to overclaim the scope of editability. Because the processing stage classifies elements as vector or raster, some components may remain embedded raster assets inside the SVG rather than being decomposed into pure vector primitives. “Editable SVG” therefore denotes a structured SVG composition, not a guarantee that every visual element has been reauthored as native vector geometry. This is a crucial distinction: editability is structural and local, but not necessarily uniform at the primitive-path level (Zhao et al., 28 May 2026).

A plausible implication is that CraftEditor occupies a middle ground between raster preservation and full semantic redrawing. It preserves enough structure to enable local revision while tolerating hybrid outputs in which some elements remain raster. That compromise appears central to its practical treatment of complex scientific figures (Zhao et al., 28 May 2026).

5. Empirical evaluation and ablation results

CraftEditor is evaluated on a random held-out subset of 80 rasters sampled from Crafter’s outputs across PaperBanana-Bench and CraftBench, balanced across academic figures, posters, and infographics. Two raster-to-editable baselines are compared. Edit-Banana uses a SAM-based segmentation pipeline and writes DrawIO cells. AutoFigure-Edit detects icons with SAM and emits a full SVG in a single LLM call. All methods receive the same input raster and are run with public default settings (Zhao et al., 28 May 2026).

Evaluation uses a three-VLM ensemble judging rendered SVG outputs against the original input raster on a 0–10 rubric with seven axes: position, color, text, icon, arrow, style, and overall. The appendix names the judges as Gemini 3.1 Flash-Lite, GPT-5.4, and Doubao-Seed-2.0-Pro. Each judge receives the input raster and rendered SVG, returns JSON with per-axis scores and issues, uses temperature 0.15 and an output cap of 4,000 tokens, and is re-queried once if it assigns overall below 3.0, with the retry replacing the original if higher (Zhao et al., 28 May 2026).

Quantitatively, CraftEditor scores 8.10 on position, 8.34 on color, 7.61 on text, 8.07 on icon, 7.83 on arrow, 8.12 on style, and 8.04 overall. AutoFigure-Edit scores 6.92, 7.41, 6.04, 6.78, 6.39, 7.00, and 6.91 respectively. Edit-Banana scores 4.21, 4.93, 2.86, 4.97, 4.18, 4.32, and 3.69. CraftEditor is therefore reported as best on every axis, with the strongest gains on structural dimensions, especially text and arrows, which aligns with the explicit use of programmatic structural checkers (Zhao et al., 28 May 2026).

Two ablations are reported. Removing agentic cleaning yields 7.84 position, 8.12 color, 7.32 text, 7.69 icon, 7.55 arrow, 7.83 style, and 7.71 overall, an overall drop of 0.33 relative to the full system. Removing iterative composition yields 6.05 position, 6.41 color, 5.32 text, 5.94 icon, 5.71 arrow, 6.10 style, and 5.89 overall, an overall drop of 2.15. The paper concludes that both designs are jointly necessary, with iterative composition the more important contributor (Zhao et al., 28 May 2026).

The appendix further reports that CraftEditor wins or ties the no-agentic-cleaning ablation in 11 of 12 source categories, with the only exception being a 3-sample text-to-image infographic subset where the difference is within noise. The inference setup also reports an average cost of \$0.85 per conversion, with most cost attributed to LLM output tokens during iterative SVG refinement (Zhao et al., 28 May 2026).

6. Scope, limitations, and position within editor research

The paper identifies several constraints. CraftEditor depends on strong proprietary models and services; its headline results rely on closed-source models and judge biases. Cost and latency are nontrivial because the system performs multiple agentic rounds. Extraction remains difficult on cluttered figures. Evaluation is limited to a held-out subset of 80 Crafter-generated rasters rather than arbitrary internet figures at scale. The study evaluates fidelity of editable output with VLM judges rather than downstream human editing efficiency, and there is no explicit human editing usability study (Zhao et al., 28 May 2026).

Within the broader editor literature, CraftEditor is notable for making explicit intermediate structure the substrate of revision. Adjacent systems in other modalities adopt related patterns. SceneCraft makes the scene graph the canonical editable state for complex image editing (Phan et al., 15 Jun 2026). “Hybrid structured editing” presents tools with a structured interface while users retain a textual editing interface (Beckmann et al., 5 Mar 2026). “Deuce” overlays structure-aware selections on top of plain text and uses transformation menus driven by those selections (Hempel et al., 2017). “Crayotter” externalizes coverage reports, blueprints, tool calls, and intermediate renders so failed video segments can be selectively revised rather than restarted (Yan et al., 31 May 2026). This suggests a broader editor design pattern: recover or externalize structure that the underlying medium normally hides, then make revision act on that structure rather than on a monolithic artifact.

CraftEditor’s specific contribution within that pattern is to treat the scientific figure raster as a source of recoverable semantic layout. It does so through instruction-driven extraction, semantic processing, iterative SVG composition, hybrid visual and programmatic verification, and SVG-source revision. In the paper’s formulation, that is what turns a good-looking but frozen raster into a structurally faithful figure that can actually be worked on (Zhao et al., 28 May 2026).