Papers
Topics
Authors
Recent
Search
2000 character limit reached

Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs

Published 28 May 2026 in cs.CV, cs.AI, and cs.CL | (2605.30611v1)

Abstract: Scientific figures are among the most effective means of communicating complex research ideas, yet producing publication-quality illustrations remains one of the most labor-intensive parts of paper preparation. Existing automated systems each target a single figure type under text-only input, leaving the diversity of types and conditions researchers actually use unaddressed; their raster outputs further cannot be locally revised. Because scientific figures are structured compositions of discrete semantic components, the localized errors generators produce on such layouts demand not a stronger backbone but a harness. We instantiate this harness in two complementary systems: Crafter, a multi-agent harness for figure generation that generalizes across figure types and input conditions without architectural changes, and CraftEditor, which applies the same pattern to convert raster outputs into editable SVGs. Moreover, we introduce CraftBench, a benchmark spanning three figure types and four input conditions with human quality annotation. Experiments show that Crafter substantially outperforms both standalone generators and the agentic baseline on PaperBanana-Bench and CraftBench, with ablations confirming each component's independent contribution; CraftEditor faithfully converts outputs into editable SVGs that surpass all baselines. Our code and benchmark are available at https://github.com/HaozheZhao/Crafter.

Summary

  • The paper presents a structured multi-agent harness called Crafter to generate editable scientific figures from diverse inputs.
  • It combines iterative plan generation, structured critique, and specification refinement to achieve robust performance improvements.
  • The approach also introduces CraftEditor for converting raster figures to editable SVGs, ensuring semantic accuracy and layout fidelity.

Harnessing Multi-Agent Orchestration for Editable Scientific Figure Generation

Overview

"Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs" (2605.30611) introduces a unified, harness-based architecture for the systematic generation and subsequent editing of scientific figures across diverse types and conditioning regimes. The paper fills crucial gaps in prior work, specifically limited generalization across figure modalities, input conditions, and the typical absence of structural editability in outputs. This is achieved by combining two core systemsโ€”Crafter, a multi-agent orchestrator for generation, and CraftEditor, a similarly harnessed system for editable raster-to-SVG conversionโ€”underpinned by structured, iterative correction and verification strategies operating over an evolving specification.

Motivation and Problem Formulation

Automated scientific figure creation remains essential but underdeveloped relative to advances in text-to-image synthesis. The scientific illustration domain presents three distinctive challenges:

  1. Combinatorial diversity: Figures span diagrams, posters, infographics, and nearly always require iterative, multimodal conditioning.
  2. Structural precision: Layouts are complex, semantically dense, and require high reliability in the rendering and placement of discrete components (e.g., labeled modules, arrows, icons).
  3. Editability: Real-world workflows necessitate local, element-wise revision, which raster-only outputs and most current pipelines cannot support.

Prior approaches have either focused narrowly on text-to-raster pipelines (e.g., [zhu2026paperbanana], [zhu2026autofigure]) or code-based generation (e.g., TikZ), each failing to generalize, integrate multi-modal inputs, or preserve downstream editability. Moreover, output images are static, and subsequent refinement via prompt or code revisions introduces inconsistency and prompt bloat, leading to semantic drift and sub-optimal results.

Crafter: Multi-Agent Harness Architecture

The Crafter harness establishes a structured, agentic orchestration loop around any black-box figure executor (e.g., advanced diffusion models). The system is built upon five loosely coupled agents sharing a structured, evolving memory specification. The pipeline comprises:

  • Intent Reasoner: Identifies communicative goals and seeds the shared specification S\mathcal{S}, abstracting the user's high-level objective from the provided context and conditioning documents.
  • Plan Generator D\mathcal{D}: Proposes KK intent-conditioned candidate layout plans to maximize initial diversity over semantically plausible renderings.
  • Image Generator E\mathcal{E}: Renders all KK plans using an underlying image synthesis backend.
  • Critic V\mathcal{V}: Evaluates each candidate via directive diagnostics comprising multi-dimensional axes (content accuracy, layout coherence, labeling, etc.) instead of scalar ratings.
  • Specification Refiner R\mathcal{R}: Integrates the criticโ€™s structured edits (typed operations) into the master specification, maintaining prompt coherence and preventing contradictory feedback found in free-text revision paradigms.
  • Convergence Judge: Mediates loop termination, early accepts satisfying candidates, or reverts to the best prior output to mitigate non-monotonic iterative refinement behavior. Figure 1

    Figure 1: Crafter architecture as a multi-agent loop for iterative plan generation, plan execution, structured critique, and spec-driven revision, over a persistent shared specification.

Diversity-Driven Plan Exploration

Rather than rely on single-shot samplingโ€”which often traps the process in compositional local minimaโ€”the system strategically branches at the plan level, invoking the generator on parallel candidate framings. This not only leverages the variance inherent to advanced diffusion models but allows escape from critical initial planning mistakes (e.g., faulty grid vs. columnar splitting, misassigned semantic zones), with KK dynamically selected per-instance.

Structured Corrective Layer and Verify-then-Refine Loop

A key innovation is the exclusive use of typed, structured specification edits based on the criticโ€™s diagnostics, as opposed to unstructured, accumulating prompt addenda. The approach enforces internal spec consistency, robustly localizes and corrects artifacts (e.g., garbled or missing labels, misroutes), and supports closed-loop, explainable iterative refinement for up to TT rounds or convergence.

CraftEditor: Raster-to-Editable SVG Conversion

For genuine post-hoc editability, raster outputs must be converted into vector (SVG) representations. CraftEditor instantiates the same harness abstraction in a three-phase pipeline:

  1. Extraction: Replaces segmentation with instruction-driven cleaning to isolate semantically salient elements, guided by a vision-language agent.
  2. Processing: Each asset is captioned, grounded, and classified as vector/raster, followed by per-element cropping.
  3. Composition: Candidate SVG skeletons are generated, assets are injected, and a hybrid critic (VLM plus programmatic checks) drives iterative, multi-round correction (e.g., arrow alignment, text bounding, element overlap), supporting robust and editable SVG assembly. Figure 2

    Figure 2: CraftEditor: sequential extraction, classification, and composition harnesses for converting raster figures to coordinate-faithful, editable SVGs.

Benchmarking with CraftBench

To critically quantify cross-type and cross-condition generalization, the paper introduces CraftBench. This 279-sample benchmark spans three figure types and four input conditions: text-to-image, mask-completion, key-element composition, and sketch-conditioned generation. Sourced from arXiv, conference posters, and research blogs, each sample is manually curated and passes a multi-stage quality pipeline with rigorous grad-level annotation. Figure 3

Figure 3: Representative tasks from CraftBench, illustrating diversity in both input conditioning and figure type.

Experimental Evaluation

Crafter is evaluated against both vanilla generators (GLM-Image, Qwen-Image, GPT-Image-2, Nano Banana 2/Pro) and prominent agentic frameworks (PaperBanana, AutoFigure) on PaperBanana-Bench and CraftBench, utilizing referenced VLM-as-judge protocols. Key results:

  • Crafter achieves state-of-the-art performance, improving over its image-generation backbone by $27.6$โ€“D\mathcal{D}0 points across benchmarks and tasks (e.g., D\mathcal{D}1 vs. D\mathcal{D}2 with Nano Banana 2 on PaperBanana-Bench).
  • Generalization: Crafter outperforms all baselines across all figure types and input conditions, not just in a narrow regime.
  • Ablation studies confirm the necessity of each harness mechanism, with the elimination of plan diversity or the corrective layer leading to degradations of D\mathcal{D}3โ€“D\mathcal{D}4 points.
  • Pluggability: Upgrading the underlying executor yields incremental rather than disruptive changes, confirming the harnessโ€™s independence from model details. Figure 4

    Figure 4: Qualitative comparison highlighting Crafterโ€™s consistency across input conditions, preserving structural fidelity and typographic detail.

For the editable-output component, CraftEditor is benchmarked against Edit-Banana and AutoFigure-Edit. It delivers superior seven-axis VLM-assessed output quality, especially for structural and typographic accuracy:

  • CraftEditor overall score: D\mathcal{D}5, compared to D\mathcal{D}6 (AutoFigure-Edit) and D\mathcal{D}7 (Edit-Banana), with ablations indicating that iterative harness-driven composition is crucial (D\mathcal{D}8 drop without it). Figure 5

    Figure 6: Editable-output system comparison; CraftEditor maintains semantic structure and editability, outperforming prior methods on text, arrow, and layout fidelity.

Failure Analysis

Despite significant improvements, residual failures remain, primarily:

  • Intent Reasoner errors: Multi-panel captions may be collapsed to a single layout if not correctly parsed, propagating a โ€˜dropped panelsโ€™ error.
  • Boundary discontinuities in mask completion: The infill may visually clash with preserved regions at the mask boundary.
  • Under-determined sketch-conditioned outputs: The generator may reproduce the structural skeleton without populating essential illustrative content. Figure 7

    Figure 8: Representative failure casesโ€”missed panels, boundary artifacts, and undercompleted skeletonsโ€”isolating key remaining bottlenecks in the harness.

Implications and Future Directions

This work demonstrates that systematic, agentic orchestration and structured spec-driven correction can fundamentally strengthen robustness and generalizability in complex scientific image synthesis and editing tasks. Harness architectures efficiently decouple task planning, plan diversity, formal critique, and semantic correction from the idiosyncrasies of underlying generators, thus future-proofing pipelines as executors improve. The same methodology is extensible to structured-output domains, offering a practical path for scalable, verifiable, and editable content synthesis across scientific, educational, and technical documentation. Key research avenues include further coverage expansion in benchmarks (e.g., more infographics), refining intent reasoning and critic modules, and integrating more advanced execution and verification primitives.

Conclusion

Crafter and CraftEditor, leveraging multi-agent harnesses around black-box generators, set new standards for scientific figure synthesis and structural editability under diverse, realistic input conditions (2605.30611). The strict, compositional separation of planning, execution, and error correction yields robust performance gains over previous agentic and generative approaches and establishes a blueprint for next-generation structured visual content pipelines.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What this paper is about

This paper shows a new way to help researchers make clear, goodโ€‘looking scientific figures (like diagrams, posters, and infographics) fasterโ€”and make them easy to edit afterward. The authors build two tools:

  • Crafter: a โ€œteamโ€ of AI helpers that plans, draws, checks, and fixes figures from different kinds of inputs (not just text).
  • CraftEditor: a tool that turns a regular image (pixels) into an editable drawing (shapes and text) so you can tweak parts without redrawing everything.

They also create CraftBench, a new test set to fairly check if these tools work well across many figure types and input situations.

What questions the paper asks

In simple terms, the paper asks:

  • Can we build one flexible system that makes many kinds of scientific figures, starting from text, sketches, partial layouts, or references?
  • Can we keep improving a figure by finding and fixing small mistakes step by step, instead of just retrying from scratch?
  • Can we turn non-editable pictures into editable diagrams that keep the original layout?
  • Does this approach actually beat existing methods on a wide range of tasks?

How the approach works (in everyday language)

A โ€œharnessโ€: like a coach and checklist around a drawing engine

The core idea is a โ€œharness,โ€ which you can think of as a coach plus a shared checklist wrapped around an image generator. Instead of hoping one big model gets everything right in one go, the harness:

  • Plans what to draw,
  • Generates images,
  • Checks for specific problems,
  • Updates a structured checklist (not just messy notes),
  • Tries again with targeted fixes.

This is different from just โ€œprompting harder.โ€ Itโ€™s more like organizing a team with roles and a plan.

Crafter: a team of AI helpers with clear jobs

Crafter uses several specialized โ€œagents,โ€ like teammates:

  • Intent reasoner: understands what the figure should communicate and drafts the first checklist.
  • Plan generator: proposes several different layout ideas at once (like different poster layouts).
  • Image generator: produces pictures for those plans.
  • Critic: reviews the result and points out specific issues (e.g., a label is unreadable, an arrow points wrong).
  • Specification refiner: updates a structured โ€œspecโ€ with precise edits (e.g., โ€œincrease title font size,โ€ โ€œalign the two boxesโ€).
  • Convergence judge: decides if the figure is good enough or if they should refine more.

Three simple but powerful ideas make this work:

  • Explore multiple plans first: Try a few layout ideas up front so you donโ€™t waste time fixing a bad layout.
  • Structured fixes, not messy notes: Keep one clean, evolving checklist rather than piling on confusing instructions.
  • Verify, then refine: Use a critic that gives detailed feedback and recheck after each fix.

CraftEditor: make images truly editable

Most AI image tools give you pixel images (like photos), which are hard to edit piece by piece. CraftEditor converts these into SVGs (vector graphics), which are made of shapes and text you can move, recolor, or relabel easily.

It does this in three phases:

  1. Extraction: cleans the image and separates pieces (icons, arrows, boxes) without relying only on fragile segmentation.
  2. Processing: captions and classifies elements to understand what they are.
  3. Composition: assembles everything into an SVG, then iteratively checks and fixes layout problems (like overlapping elements or wrong arrow endpoints).

Quick glossary

  • Raster image: a pixel picture (like a smartphone photo). Hard to edit details precisely.
  • Vector image (SVG): shapes and text (like a diagram in a slide). Easy to move, resize, and edit.
  • Benchmark: a standardized test set to compare methods fairly.

What they built to test it: CraftBench

CraftBench contains 279 real examples across:

  • Three figure styles: academic diagrams, posters, and infographics.
  • Four input conditions: text-only, sketch-conditioned, key-element composition (you provide key parts to include), and mask-completion (fill in a specified area).

It includes quality checks by humans and uses an AI judge in a careful way to score results on things like faithfulness, readability, and how well the input instructions were followed.

What they found and why it matters

  • Crafter consistently beat both standalone image generators and other โ€œagentโ€ systems on two benchmarks (including their new CraftBench). It worked across all figure types and input conditions, not just one niche.
  • When they removed any one of Crafterโ€™s three key ideas (plan exploration, structured fixes, or the verifyโ€‘thenโ€‘refine loop), performance dropped noticeably. This shows each part is important.
  • CraftEditor produced higherโ€‘quality editable SVGs than baseline methods, especially on tricky structural details like text alignment and arrow connections. That means itโ€™s better at turning pictures into properly editable diagrams.

Why this matters:

  • Researchers often need to tweak small parts of figures: fix a label, swap an icon, move a box. Crafter plus CraftEditor makes this far easier.
  • The harness is โ€œplugโ€‘andโ€‘playโ€: as better image generators arrive, you can swap them in without changing the overall system.
  • The idea of a harnessโ€”plan, check, fix, repeatโ€”could help in other tasks that need structured, precise results, not just pretty images.

Simple takeaway

Making scientific figures is hard and timeโ€‘consuming. This paper shows a practical way to organize AI helpers so they can draft, check, and refine figures more reliablyโ€”and then turn those figures into easyโ€‘toโ€‘edit diagrams. The result saves time, reduces frustration, and gives researchers more control over their work.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The following concrete gaps identify what remains uncertain, untested, or unexplored, and where future work can act.

  • Benchmark scope is narrow relative to real research needs: CraftBench covers only three figure types and four input conditions; it excludes charts/plots, data-driven visualizations, math-heavy figures, CAD/circuit/molecular diagrams, 3D/interactive/animated figures, and very large posters with 100+ elements.
  • Sample size and sampling bias: 279 curated samples may be too small to stress-test generalization; curation from arXiv/award posters/blogs may bias styles and domains.
  • Language and script diversity are untested: no evaluation on non-Latin scripts, bilingual figures, right-to-left layouts, or mathematical typesetting fidelity.
  • Accessibility and publishing compliance are unaddressed: no evaluation of color-blind palettes, font embedding, alt-text generation, reading order, or publisher style-guide conformance.
  • Real-world constraint handling is not evaluated: support for strict aspect ratios, page/grid systems, margins/bleeds, corporate branding, color palettes, and typographic hierarchies is unspecified.
  • No measurement of user impact: there is no user study on author time saved, number of manual edits avoided, or perceived usefulness in real paper-writing workflows.
  • VLM-as-judge reliability is under-validated: limited analysis of inter-judge agreement, sensitivity to judge choice/prompts/tie-band calibration, or correlation with expert human ratings across figure types.
  • Lack of standardized structural metrics for generation: beyond referenced VLM scoring, there are no objective geometric/layout metrics (e.g., IoU for bounding regions, arrow endpoint error, label placement/overflow rates).
  • Editable-output fidelity is insufficiently quantified: CraftEditorโ€™s โ€œcoordinate-faithfulโ€ claim lacks precise metrics (e.g., per-element alignment deviation, overlap violations, text overflow counts) reported against ground truth.
  • Evaluation data for CraftEditor is not independent: SVG conversion is assessed on 80 Crafter outputs rather than diverse in-the-wild rasters with ground-truth vectors, risking pipeline-overfitting.
  • Icons remain raster in the SVG: the method injects raster assets into a vector skeleton; true vectorization of icons, font glyphs, and shape boundaries (with acceptable simplification) is not addressed.
  • Text handling is under-specified: no OCR accuracy, font mapping, ligatures, math notation, non-English scripts, or fallback strategies are reported for SVG composition.
  • Layering, grouping, and semantics in SVG are not evaluated: object grouping, z-order consistency, reusable symbols, anchors, and semantic tags (for downstream edits or accessibility) are unmeasured.
  • Arrow routing and connector logic are heuristic: the programmatic checkers audit endpoints, but path routing algorithms, avoidance of occlusions, and connector reflow under edits are not described or benchmarked.
  • Harness convergence is empirical only: no analysis of convergence guarantees, regression frequency, or conditions under which verify-then-refine oscillates or collapses.
  • Sensitivity to hyperparameters is unclear: the adaptive selection rules for number of plans K, rounds T, and decoding temperatures are not specified or validated; robustness to these settings is unknown.
  • Compute and latency costs are opaque: per-sample wall-clock time, GPU-hours, token usage, and cost vs. quality trade-offs for K, T, and critic complexity are not reported.
  • Reproducibility risks from proprietary components: dependence on closed models (e.g., Gemini judges/backbones) and instability/safety refusals (e.g., GPT-Image-2) threatens replicability across environments.
  • Breadth of executor-agnostic claim is untested: only two backbones are used; compatibility with diffusion-based open models, layout-aware renderers, or code-generation executors for rendering is not demonstrated.
  • Prompt robustness is unexamined: sensitivity to prompt phrasing, prompt injection from user-provided context/docs, and cross-LLM portability of agent prompts are not studied.
  • Failure mode taxonomy is missing: while examples exist, there is no systematic categorization/quantification of failure types (e.g., label garbling, misaligned connectors, missed key elements) and their root causes.
  • Safety, copyright, and privacy are not discussed: handling of copyrighted icons/logos, extraction of sensitive text from figures, and compliance with dataset licensing are unspecified.
  • Input-constraint fidelity beyond included tasks is untested: e.g., strict layout preservation from sketches, color-palette locking, element โ€œdo-not-moveโ€ constraints, and brand assets with legal must-haves.
  • Post-editing interoperability is unverified: performance and stability when opening/editing generated SVGs in common tools (Illustrator, Inkscape, Figma, draw.io) are not evaluated.
  • Human-in-the-loop workflows are absent: no protocol for integrating user corrections into the structured specification S, conflict resolution for competing constraints, or guarantees that local edits remain local.
  • Generalization beyond scientific figures is unknown: applicability to UI mockups, technical schematics, or instructional diagrams with different conventions is not established.
  • Statistical significance is not reported: no uncertainty estimates or significance tests for headline improvements across benchmarks/tasks.
  • Robustness to adversarial or contradictory inputs is untested: how the harness behaves when the conditioning input conflicts with instructions or contains misleading elements remains open.
  • Long-horizon editing stability is unclear: cumulative degradation across many refinement rounds or large edit sequences is not measured.
  • Data-binding and chart generation are out of scope: converting data tables to plots or ensuring numerical fidelity within charts is not addressed, yet common in scientific figures.

Practical Applications

Immediate Applications

The paperโ€™s harness-based systems (Crafter for generation and CraftEditor for raster-to-SVG conversion) are deployable with todayโ€™s image/VLM backends and can be integrated into existing authoring and design workflows. Below are concrete, sector-linked use cases.

  • Academia and Education
    • Scientific figure and poster authoring assistants (Overleaf/LaTeX, MS PowerPoint, Google Slides, Keynote)
    • Tools/workflows: โ€œCrafter Assistantโ€ plugin to generate figures from section text, sketches, or partial layouts; โ€œCraftEditorโ€ button to turn final rasters into editable SVGs for lastโ€‘minute label fixes, color changes, or panel rearrangements.
    • Assumptions/dependencies: Access to an image-generation backend (e.g., Nano Banana, GPT-Image); VLM availability for the critic; compute budget for multi-plan exploration and iterative refinement; privacy controls for manuscript content.
    • Rapid classroom material creation (lesson diagrams, lab workflows, assessment infographics)
    • Tools/workflows: Sketch-to-diagram via tablet; library of curriculum-aligned icons; export to Slides/Canva/Figma as SVG.
    • Assumptions/dependencies: Teacher-provided scaffolds; domain-specific vocabularies; human review for accuracy.
  • Software/IT and Enterprise Documentation
    • Architecture/API/DevOps diagrams and runbooks
    • Tools/workflows: Generate drafts from README/specs; key-element composition using company icon packs; SVG handoff to draw.io, Figma, or Mermaid editors; iterative fix loop to enforce arrow endpoints, grid alignment, readable labels.
    • Assumptions/dependencies: Icon libraries and naming conventions; style-guide prompts; on-prem/intraโ€‘net deployment for source confidentiality.
    • Knowledge base and SOP visualization (Confluence/Notion integrations)
    • Tools/workflows: Text-to-figure and sketch-conditioned refinements; CraftEditor to convert legacy rasters for ongoing maintenance.
    • Assumptions/dependencies: Authentication and content-safety filters; governance for edits.
  • Marketing and Communications (Cross-industry)
    • Research blog infographics and explainer visuals
    • Tools/workflows: Draft multiple layouts with diversity-driven plan exploration; choose the best framing; localize colors/labels via SVG edits.
    • Assumptions/dependencies: Brand color/style prompts; legal check for iconography licensing.
  • Healthcare and Life Sciences
    • Patient education leaflets and SOP flowcharts
    • Tools/workflows: Sketch-conditioned or key-element composition from clinical guidelines; CraftEditor for standardized icons and unambiguous arrows/text.
    • Assumptions/dependencies: Clinical review; controlled vocabulary and safety symbols; strict PHI handling.
  • Finance and Regulated Industries
    • Compliance workflows (KYC/AML), risk process diagrams
    • Tools/workflows: Encode house rules as typed edits (e.g., mandatory disclaimers, label size minima); SVG edits for audit trails.
    • Assumptions/dependencies: Human-in-the-loop signoff; policy rule packs; reproducibility logs.
  • Conferences and Events
    • Poster layout drafting and revision
    • Tools/workflows: Generate multi-column/poster layouts from abstracts; mask-completion to update a section without full reโ€‘render; SVG export for printer-ready tweaks.
    • Assumptions/dependencies: High-resolution rendering; printer specs; time budget for iterative verification.
  • Accessibility and Localization
    • Readability improvements and language swaps
    • Tools/workflows: Use CraftEditor to replace rasterized text with real vector text; enlarge fonts; adjust color palettes for contrast; swap labels to another language while preserving layout.
    • Assumptions/dependencies: Translation resources; accessibility guidelines (e.g., WCAG) encoded as checks; font availability.
  • Research and QA of Generative Systems
    • Benchmarking and regression testing with CraftBench
    • Tools/workflows: Evaluate new figure generators or agentic pipelines across types and input conditions using the provided VLM-based protocol.
    • Assumptions/dependencies: Consistent VLM judge; calibration to reduce bias; dataset license compliance.
  • Design Tool Extensions
    • Figma/Illustrator/Draw.io plugins for raster-to-editable conversion
    • Tools/workflows: One-click โ€œMake Editable (SVG)โ€ powered by CraftEditor; hybrid critic ensures arrow endpoints, text bounding, and non-overlap.
    • Assumptions/dependencies: Plugin APIs; local processing or secure cloud execution; possible content-safety refusals by backends.
  • Team-wide Consistency for Figure Families
    • Cohesive multi-figure sets within a paper/report
    • Tools/workflows: Carry a shared specification (colors, fonts, grid, iconography) across figures; typed edits enforce global constraints.
    • Assumptions/dependencies: Project-level style guide captured in prompts/spec memory; versioning of specifications.

Long-Term Applications

These require further research, scaling, domain checkers, or tighter systems integration but follow directly from the paperโ€™s harness pattern (executor-agnostic, structured spec as memory, directive critics, iterative refinement).

  • Generalized Harness for Structured Visual Outputs
    • UI wireframes, dashboard layouts, slide decks, and presentations
    • Potential products: โ€œSlideCrafterโ€ for deck auto-layout; โ€œDashCrafterโ€ for BI dashboards with chart/legend/label checks.
    • Dependencies: Domain-specific critics (e.g., chart semantics, responsive layouts); integrations with PowerPoint, Google Slides, or BI tools.
  • CAD/Engineering and Safety-Critical Diagramming
    • P&ID, electrical schematics, robotics cell layouts
    • Potential products: โ€œEngCraftEditorโ€ with programmatic checkers for standards (ISO/IEEE symbols, line types, clearances).
    • Dependencies: High-precision vector extraction; strict rule engines; certification and traceability; domain iconography libraries.
  • Automated Policy/Standards Compliance for Diagrams
    • Enforcing style guides and regulatory rules at authoring time
    • Potential products: Rule packs for ACM/IEEE formatting; enterprise branding rule modules; compliance dashboards.
    • Dependencies: Formalized rule sets; explainable failure reports; org-wide deployment and governance.
  • Accessibility-by-Design and Audited Explainability
    • Auto-generation of alt text, reading order, and colorblind-safe palettes
    • Potential products: Accessibility critic modules; auto-contrast fixers; export with embedded accessibility metadata.
    • Dependencies: Reliable text extraction and semantics; WCAG rule codification; multilingual alt-text generation and validation.
  • Cross-lingual and Cultural Localization with Reflow
    • Automatic re-layout for text expansion/contraction across languages
    • Potential products: โ€œLocalize & Reflowโ€ services integrated with TMS (translation management systems).
    • Dependencies: Text measurement and reflow constraints; locale-specific typographic rules; QA workflows.
  • Knowledge-Graphโ€“toโ€“Diagram and Diagramโ€“toโ€“Spec Loops
    • Bidirectional mappings between structured specs and visuals
    • Potential products: Import/export to JSON/YAML diagram specs; automated consistency checks between docs and figures.
    • Dependencies: Robust inverse parsing; schema agreements; doc mining.
  • Co-creative Design Agents (โ€œAI Pair Designerโ€)
    • Real-time, mixed-initiative editing: humans sketch; agent proposes, critiques, and applies typed edits with live previews
    • Potential products: IDE-like design environment with critic linting for visuals.
    • Dependencies: Low-latency VLMs; streaming diff/preview pipelines; stronger non-monotonicity guards.
  • Synthetic Data Generation for Training
    • Large corpora of high-diversity, high-fidelity structured figures with annotations
    • Potential products: Data services for training diagram code generators or OCR/layout models.
    • Dependencies: Quality control at scale; licensing/attribution; diversity coverage across domains.
  • End-to-End Publication Automation
    • From manuscript sections to numbered, cross-referenced figures with consistent styles and auto-updated captions
    • Potential products: โ€œPaperOpsโ€ pipelines integrated with Overleaf/CI; proofing bots for figures.
    • Dependencies: Reliable section summarization; citation and cross-ref syncing; editorial acceptance.
  • Interactive and Data-Linked Figures
    • Generate SVGs that bind to underlying datasets for live updates
    • Potential products: Data-bound figures in Jupyter/RMarkdown with auto-propagated layout constraints.
    • Dependencies: Stable data bindings; separation of content vs. presentation; critics for data semantics.
  • Policy Dashboards and Public Engagement
    • From long-form policy text to explainer visuals and interactive summaries
    • Potential products: Civic-tech hubs producing standardized, editable visuals for public portals.
    • Dependencies: Fact-checking critics; misinformation safeguards; human oversight in sensitive domains.

Notes on feasibility across applications:

  • All harness capabilities depend on the availability and stability of image/VLM backends and instructable image editors; content-safety filters may block certain inputs.
  • Iterative loops and plan exploration incur latency and cost; on-device or on-prem deployments may be needed for privacy or speed.
  • CraftEditorโ€™s quality hinges on clean extraction and robust hybrid critics; highly cluttered or photorealistic imagery remains challenging.
  • Legal/IP for icon libraries and brand assets must be addressed; human review is essential in regulated sectors.
  • Encoding domain rules into directive critics and programmatic checkers is the main development effort when transferring to new verticals.

Glossary

  • Ablations: Controlled experiments where components of a system are removed or altered to assess their contribution. "Ablations validate each mechanism"
  • Acceptance thresholds: Predefined minimum scores on evaluation criteria that determine whether to stop refinement. "already satisfies acceptance thresholds on critical dimensions."
  • Agentic baseline: A comparative system that uses agent-based orchestration to perform the task, serving as a benchmark. "outperforms both standalone generators and the agentic baseline on PaperBanana-Bench and CraftBench"
  • Agentic pipelines: Multi-agent workflows that coordinate planning and generation to produce outputs. "agentic pipelines that pair planning agents with powerful image generators to produce visually polished figures from text"
  • Backbone: The primary underlying model used for generation; stronger backbones are often sought to improve quality. "demand not a stronger backbone but a harness."
  • Best-so-far checkpoint: A saved version of the highest-scoring output so far, used to revert if later iterations degrade. "with a best-so-far checkpoint that reverts to aโˆ—a^* whenever the current round regresses"
  • Calibrated tie band: A predefined score margin within which outcomes are considered ties to reduce noise in comparative evaluation. "under a calibrated tie band."
  • Closed-loop verification: A feedback process where outputs are checked against the original intent and iteratively corrected. "enabling targeted correction of individual failure points and closed-loop verification against the original intent."
  • Code-generation methods: Approaches that produce editable figure code (e.g., diagram code) rather than pixels. "code-generation methods that synthesize editable diagrams in TikZ or similar formats"
  • Conditioning input: Additional input (e.g., sketches, masks, elements) that constrains or guides generation. "whether a system generalizes across figure types or preserves a user's conditioning input."
  • Convergence judge: An agent that decides whether to accept, refine, or revert during iterative generation. "the convergence judge routes each round to accept, refine, or revert to Final output."
  • Coordinate-faithful: Preserving precise spatial positions and relationships when converting or reproducing graphics. "into a coordinate-faithful editable SVG v\mathbf{v}"
  • Directive critic: An evaluator that provides actionable, targeted corrections rather than only scalar scores. "a verify-then-refine loop in which a directive critic issues targeted corrections rather than scalar scores."
  • Diversity-driven plan exploration: Generating multiple candidate plans upfront to explore different compositions before refinement. "diversity-driven plan exploration generates multiple candidate framings in parallel"
  • Early-exit gate: A mechanism to stop the iterative loop as soon as quality thresholds are met. "An early-exit gate bypasses the loop when the first-round output already satisfies acceptance thresholds on critical dimensions."
  • Executor-agnostic: Designed to work with different underlying generators or executors without architectural changes. "Because the harness is executor-agnostic"
  • Free-text revision: Unstructured natural-language instructions appended across iterations, which can lead to contradictions. "Iterative repair via free-text revision instructions degrades rapidly"
  • Hallucination filter: A safeguard that discards spurious or irrelevant extractions produced during processing. "with a hallucination filter discarding blank, mismatched, or text-only extractions"
  • Harness: An orchestration layer around a generator that plans, verifies, and revises outputs using structured memory. "What is needed in both settings is not a better generator but a harness"
  • Hybrid critic: A combined evaluator using both model-based judgments and programmatic checks. "via a hybrid critic that combines two complementary channels"
  • Intent reasoner: An agent that infers goals and required elements from inputs to seed the initial specification. "An intent reasoner analyzes (c,q)(\mathbf{c}, \mathbf{q}) and infers the figure's communicative role"
  • Inverse parsing: Recovering a structured representation from generated diagrams to evaluate structural fidelity. "inverse-parses generated diagrams into structured graphs"
  • Keep/delete plan: A plan specifying which elements to retain or remove during extraction. "authors a per-figure keep/delete plan ptp_t specifying which elements to preserve and which to remove."
  • Mask-completion: A task where generation fills in or completes specified masked regions conditioned on a mask. "mask-completion ($30$)"
  • Non-monotonic: Performance that does not consistently improve across iterations and may regress. "language-model-driven iterative editing is empirically non-monotonic"
  • OCR: Optical Character Recognition; extracting text from images for downstream processing. "converts segmentation and OCR outputs into DrawIO cells."
  • Pluggable: Replaceable without changing the surrounding system, enabling modular swap-in of components. "E\mathcal{E} is pluggable, so all task-specific behavior resides in the prompts"
  • Position bias: A bias arising from the placement of items in pairwise comparisons, which can skew judgments. "which removes the position bias of pairwise comparison."
  • Programmatic checkers: Automated rules or scripts that validate structural properties of an output. "programmatic checkers auditing structural properties (text overflow, arrow-endpoint accuracy, element overlap, missing components)"
  • Raster-to-vector conversion: Transforming pixel-based images into vector graphics for editability. "recent raster-to-vector attempts remain limited by unreliable element extraction and fragile composition."
  • Reference-conditioned tasks: Generation tasks that use auxiliary inputs like sketches or elements as constraints. "the three reference-conditioned tasks"
  • Segmentation: Partitioning an image into regions or elements, often used to isolate components. "off-the-shelf segmentation"
  • Specification refiner: An agent that writes structured edits to the evolving specification based on critiques. "the specification refiner R\mathcal{R} writes typed edits into S\mathcal{S}"
  • Structured corrective layer: A mechanism that accumulates typed, structured edits instead of free text to avoid contradictions. "The structured corrective layer replaces free-text accumulation with typed edits on S\mathcal{S}"
  • Structured memory: A persistent, structured record (specification) shared across agents to track plans and revisions. "share an evolving figure specification as the pipeline's structured memory"
  • SVG skeleton: A scaffold SVG structure with placeholders used to assemble final vector graphics. "an SVG skeleton and iteratively refines the result"
  • TikZ: A LaTeX-based language for programmatically creating diagrams and graphics. "in TikZ or similar formats"
  • Typed edits: Structured, typed operations applied to the specification to guide revisions coherently. "writes typed edits into S\mathcal{S}"
  • VLM-as-judge: Using Vision-LLMs as evaluators for generated images. "Our evaluation keeps the referenced VLM-as-judge philosophy"
  • Weighted mean: An average where different aspects are combined with specified weights to compute a total score. "A weighted mean turns the per-aspect scores into one total per image"
  • Verify-then-refine loop: An iterative process that evaluates outputs, issues directives, and revises until criteria are met. "verify-then-refine loop"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 63 likes about this paper.