AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations
Abstract: High-quality scientific illustrations are crucial for effectively communicating complex scientific and technical concepts, yet their manual creation remains a well-recognized bottleneck in both academia and industry. We present FigureBench, the first large-scale benchmark for generating scientific illustrations from long-form scientific texts. It contains 3,300 high-quality scientific text-figure pairs, covering diverse text-to-illustration tasks from scientific papers, surveys, blogs, and textbooks. Moreover, we propose AutoFigure, the first agentic framework that automatically generates high-quality scientific illustrations based on long-form scientific text. Specifically, before rendering the final result, AutoFigure engages in extensive thinking, recombination, and validation to produce a layout that is both structurally sound and aesthetically refined, outputting a scientific illustration that achieves both structural completeness and aesthetic appeal. Leveraging the high-quality data from FigureBench, we conduct extensive experiments to test the performance of AutoFigure against various baseline methods. The results demonstrate that AutoFigure consistently surpasses all baseline methods, producing publication-ready scientific illustrations. The code, dataset and huggingface space are released in https://github.com/ResearAI/AutoFigure.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper is about teaching AI to make clear, good-looking scientific diagrams from long pieces of scientific writing. The authors built a large dataset called FigureBench and a smart system called AutoFigure that reads complex text, plans a clean layout, and then draws a polished illustration that’s ready to go in a real research paper.
What questions did the researchers ask?
- Can an AI read long, detailed scientific text and turn it into a diagram that is both accurate and easy to understand?
- What steps does the AI need to take to keep the diagram correct while also looking professional?
- How can we fairly test and compare different AI methods that try to make scientific illustrations?
- Will human experts (the authors of the papers) accept these AI-made figures for publication?
How did they do it?
The dataset: FigureBench
The team created FigureBench, a collection of 3,300 pairs of scientific texts and their matching figures. It includes many types of documents, such as research papers, surveys, blogs, and textbooks. They carefully built a 300-example test set (200 from papers and 100 from other sources) and checked quality with two human reviewers who agreed at a very high rate (Cohen’s kappa = 0.91, which means the reviewers consistently agreed beyond random chance). The remaining 3,000 examples are for development and future training by the community.
Why this matters: If you want to judge how well an AI can make figures, you need lots of real examples that cover different styles and levels of complexity. FigureBench provides that.
The tool: AutoFigure
AutoFigure creates figures in two main stages, using a “think-then-draw” approach:
- Stage I — Planning the “blueprint”
- The AI reads the long text and extracts the main ideas (like the steps in a method, the components of a system, and how they connect).
- It turns these ideas into a structured “blueprint” of the figure (using formats like SVG/HTML). Think of this like a map that says, “Put a box here, an arrow there, and label this part with that name.”
- It uses an internal loop with two roles: an AI “designer” and an AI “critic.” The designer proposes a layout; the critic points out problems (like overlapping elements or confusing flow) and suggests improvements. They repeat until the blueprint is balanced and clear.
- Stage II — Drawing and fixing text
- The AI converts the blueprint into a high-quality image with the chosen style (colors, shapes, and overall look).
- Text in images can be blurry, so AutoFigure uses an “erase-and-correct” trick:
- It erases the text from the image background.
- It uses OCR (a tool that reads text in images) to find the labels.
- It checks and corrects the labels based on the blueprint.
- It overlays crisp, sharp text back onto the image.
- This ensures both the structure and the words are correct and readable.
How they judged the results
- AI-as-a-judge: They used a vision-LLM (an AI that understands both images and text) to score figures for:
- Visual design (looks and professional finish),
- Communication (clarity and logical flow),
- Content fidelity (accuracy and completeness).
- Blind comparisons: The AI judge saw the text plus two figures (one original, one AI-made) and had to pick which was better without knowing which was which.
- Human experts: They asked first authors of 21 papers to rate figures for accuracy, clarity, and aesthetics, and to decide which ones they’d actually publish.
What did they find?
Here are the main results from their tests and comparisons:
- AutoFigure consistently beat other methods across document types (blogs, surveys, textbooks, and papers). It had high overall scores and win-rates in blind comparisons.
- In textbooks, AutoFigure’s figures won 97.5% of comparisons—showing strong clarity and teaching quality.
- In papers, AutoFigure still led, with a 53.0% win-rate against the original references in blind tests (a tough category with very detailed content).
- Human experts judged AutoFigure’s figures far better than other AI systems. Most importantly, 66.7% said they would include AutoFigure’s figures in their final, camera-ready paper.
- The “designer-and-critic” refinement loop clearly helped: more thinking iterations improved scores.
- Stronger reasoning models and structured formats (like SVG/HTML) produced better figures than less structured approaches (like building slides piece by piece).
Why this matters: It shows that careful planning (the blueprint) plus thoughtful rendering (the final image) helps an AI produce figures that are both accurate and attractive—good enough for real science communication.
Why does this research matter?
- Scientific diagrams help people understand complex ideas quickly. Good figures can save researchers and students hours of confusion.
- Making these by hand takes a lot of time and skill. AutoFigure can speed this up, helping scientists share results more clearly and faster.
- As AI systems start doing more science tasks, they need to explain their ideas visually. AutoFigure fills a big gap by turning complex findings into understandable diagrams.
What could happen next?
- Researchers could use AutoFigure in writing tools (like paper editors or slide makers) to auto-generate figures from their drafts.
- Teachers and students might use it to turn lessons or study notes into clear visual summaries.
- The released dataset and code can push the field forward, inspiring better tools and new research.
- The authors also discuss ethics: powerful tools can be misused to make misleading figures. To reduce this risk, they require clear attribution and disclaimers, reminding everyone that AI outputs still need expert checking.
Key terms explained
- Blueprint (layout): A detailed plan of where boxes, arrows, and labels go before drawing the final image.
- SVG/HTML: Text-based formats that describe shapes and positions—like a recipe for a drawing.
- OCR (Optical Character Recognition): A tool that reads text inside images and turns it into editable text.
- Vision-LLM (AI judge): An AI that can look at pictures, read text, and give informed scores or choices.
- Cohen’s kappa: A statistic that shows how much two reviewers agree beyond random chance. A high value (like 0.91) means strong agreement.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise list of unresolved issues, constraints, and open research directions that the paper does not fully address but that future work could concretely tackle:
- Dataset coverage and scope: The benchmark excludes data-driven charts/plots and may underrepresent multi-panel figures common in publications; extend FigureBench to include quantitative visualizations, multi-panel layouts with subfigure labels, and hybrid figures (schematics + charts).
- Domain generalization: Most examples appear within CS/AI; assess performance across other disciplines (biology, chemistry, physics, medical imaging, social sciences) where visual conventions and semantics differ.
- Multilingual capability: The pipeline and OCR/text refinement are evaluated on English; evaluate and adapt for multilingual long-form texts, right-to-left scripts, and mixed-language figures.
- Mathematical typesetting: The erase-and-correct stage does not address LaTeX/math equations, special symbols, or inline formula rendering; add native equation parsing and vector LaTeX overlays with typographic fidelity.
- Accessibility and compliance: No analysis of colorblind-safe palettes, font legibility at journal column widths, contrast compliance (e.g., WCAG), or alt-text generation; introduce accessibility checks and auto-corrections.
- Ground-truth structure for evaluation: Current evaluation relies on VLM judgments; provide ground-truth symbolic layouts/graphs for test figures to enable structural metrics (graph edit distance, topology preservation, alignment scores).
- VLM-as-a-judge reliability: Quantify and improve the consistency, calibration, and bias of VLM evaluations across different judges; measure correlation with large-scale human expert ratings and inter-rater reliability beyond the small human study.
- Small human evaluation sample: The expert study covers 10 authors and 21 papers; expand to statistically powered, multi-disciplinary evaluations with detailed error analyses and publication outcomes.
- Dependency on proprietary models: The pipeline and evaluations rely on closed-source models (e.g., GPT-5, Gemini-2.5, Claude-4.1); reproduce results using open-source backbones to assess replicability and reduce evaluation circularity.
- Curation bias in FigureBench: GPT-5 was used to select representative figures and a VLM filter built from the curated set was used to expand the dev set; quantify potential selection/model biases and validate with independent human-only curation.
- Robustness to imperfect texts: The dataset emphasizes figures where “each key visual element is explicitly described”; evaluate performance on realistic papers with implicit, incomplete, or noisy descriptions and conflicting statements.
- Entity-to-icon semantics: The mapping from extracted entities/relations to visual metaphors/icons is not formally specified or validated; develop taxonomies, mapping rules, and semantic consistency checks for iconography.
- Convergence guarantees for critique-and-refine: The iterative loop stops at fixed N iterations; study convergence behavior, stopping criteria, and diminishing returns, and propose principled optimization objectives for the scoring function q.
- Risk of “judge overfitting”: The refinement loop optimizes using VLM feedback that resembles the evaluation protocol; test with disjoint judge models and adversarial evaluation to rule out optimization-to-the-metric artifacts.
- Style adherence and diversity: The paper presents mostly a unified default style; systematically evaluate style controllability, adherence to user-specified style guides (e.g., ACM/IEEE), and diversity across multiple aesthetics.
- Vector end-to-end outputs: The final rendering produces raster backgrounds with vector text overlays; investigate fully vector outputs (including shapes/icons) for editability, scalability, and print fidelity (e.g., SVG/TikZ end-to-end).
- Support for complex layouts: Assess performance on deeply nested hierarchies, long pipelines (>20 components), densely annotated diagrams, and figures with layered semantics (e.g., zoomed insets, callouts, legends).
- Integration of data visualizations: Extend the method to automatically generate charts/plots from referenced datasets or tables and ensure semantic consistency between schematics and quantitative visuals.
- Error analysis of text refinement: Provide a breakdown of OCR/verification failures (e.g., bounding box mismatches, misalignments, font substitutions, hyphenation issues) and quantify improvements from the erase-and-correct module.
- Cost, latency, and scalability: The efficiency analysis is deferred to the appendix; report detailed compute profiles (time per figure, memory, model calls), cost sensitivity to iteration count, and throughput in realistic workflows.
- Editing and iteration in practice: AutoFigure-Edit is mentioned but not evaluated; benchmark interactive editing workflows (icon replacement, re-layout, text edits) and measure human-in-the-loop productivity gains.
- Adversarial and safety robustness: Evaluate susceptibility to generating misleading-but-plausible diagrams from adversarial or ambiguous inputs; add factual consistency checks, provenance tracking, and usage constraints.
- Journal/publisher compliance: Test alignment with specific venue guidelines (font sizes, line weights, color use, resolution, captioning standards) and provide automatic validation/fix-ups prior to camera-ready submission.
- Reproducibility of prompts and schemas: Although prompts are in the appendix, assess robustness to prompt phrasing changes, release standardized I/O schemas, and publish ablations on prompt sensitivity.
Practical Applications
Immediate Applications
The following use cases can be deployed with the paper’s released codebase, dataset, and demo (AutoFigure, AutoFigure-Edit, FigureBench), leveraging the Reasoned Rendering pipeline (semantic parsing + critique-and-refine + aesthetic rendering with erase-and-correct).
- Academic publishing: rapid method-figure drafting
- Sectors: academia, publishing, software (tools)
- What: Generate publication-ready conceptual schematics from method/results sections to shorten authoring cycles.
- Tools/workflows: AutoFigure integrated into LaTeX/Overleaf or Word workflows; “AutoFigure-Edit” for post-tweaks; export SVG/HTML for journals.
- Assumptions/dependencies: Works best for conceptual (not data-plot) figures; requires long-form, well-structured text; human verification recommended (ethical note from paper).
- Grant proposals and peer reviews: concise mechanism diagrams
- Sectors: academia, funding agencies
- What: Auto-generate pipeline overviews for proposals; reviewers can request/produce quick visual summaries of complex methods.
- Tools/workflows: Document editor plugins; batch generation of figures from proposal drafts; VLM-as-a-judge scoring for quality checks.
- Assumptions/dependencies: Long-context LLM availability; confidentiality safeguards for sensitive submissions.
- Technical documentation and software architecture diagrams
- Sectors: software, DevOps, cybersecurity
- What: Convert design documents/ADRs/runbooks into system diagrams (microservices, data flows, threat models).
- Tools/workflows: GitHub Actions/CI step to render figures from markdown specs; wiki/CMS plugin; SVG output for versioning.
- Assumptions/dependencies: Access to internal docs; accurate entity/relation extraction from prose; security reviews for on-prem deployment.
- Manufacturing and operations: SOP/process schematics
- Sectors: manufacturing, logistics, robotics
- What: Transform standard operating procedures and maintenance manuals into step-by-step flow diagrams for training and audits.
- Tools/workflows: LMS integration; AutoFigure-Edit for icon/text adjustments; printable SVG/PDF output.
- Assumptions/dependencies: Domain terminology mapping; human-in-the-loop validation for safety-critical steps.
- Healthcare communication: clinical pathway and patient education visuals
- Sectors: healthcare, life sciences
- What: Create clear care-pathway, triage, or procedural diagrams from protocols and patient-facing guides.
- Tools/workflows: EHR/clinical knowledge-base export to long-form text; rendering with erase-and-correct for crisp medical terminology.
- Assumptions/dependencies: Medical review for correctness; HIPAA-compliant deployment; current scope excludes quantitative charts.
- Education content creation: textbook- and MOOC-style diagrams
- Sectors: education, edtech
- What: Generate instructional figures from lecture notes/chapters (taxonomies, pipelines, timelines) in consistent styles.
- Tools/workflows: LMS or CMS plugin; style presets (e.g., “textbook” vs. “blog”) to align with course branding.
- Assumptions/dependencies: Teacher approval; alignment to pedagogy; multilingual labels may need manual QA.
- Science communication and media: blog/tech explainer visuals
- Sectors: media, technical marketing
- What: Produce visually polished, accurate schematics from long-form explainers and whitepapers.
- Tools/workflows: CMS extension; editorial style sheets; VLM-as-a-judge for pre-publication QA.
- Assumptions/dependencies: Editorial standards; brand color/style libraries; careful claims review to avoid miscommunication.
- Compliance and policy briefs: regulatory process flowcharts
- Sectors: public sector, finance, energy
- What: Convert regulations/standards into flow diagrams showing decision points and obligations.
- Tools/workflows: Document-to-diagram batch pipelines; SVG for collaborative review; traceability notes embedded in layers.
- Assumptions/dependencies: Legal review; reliance on text fidelity; complex cross-references may require iterative refinement.
- Patent drafting support: conceptual drawings from specifications
- Sectors: legal/IP, R&D
- What: Draft patent figures illustrating system components and interactions from specification text.
- Tools/workflows: Patent-counsel workflow plugin; export to compliant vector formats for patent office submissions.
- Assumptions/dependencies: Jurisdiction-specific drawing rules; attorney oversight; conceptual scope (not detailed mechanical CAD).
- Figure quality QA and remediation for existing documents
- Sectors: publishing, enterprise documentation
- What: Use the VLM-as-a-judge evaluation protocol to score clarity/accuracy/aesthetics of figures; regenerate or refine low-scoring ones.
- Tools/workflows: FigureBench-based automated scoring; AutoFigure regeneration with critique-and-refine; text legibility fixes via erase-and-correct.
- Assumptions/dependencies: Availability of underlying text; VLM-judge alignment with human preferences; maintaining provenance.
Long-Term Applications
These use cases require further research, scaling, or ecosystem adoption (e.g., tighter editor integrations, domain ontologies, or regulatory acceptance).
- End-to-end “AI Scientist” visualization module
- Sectors: AI research, R&D automation
- What: Seamless visualization of automatically discovered methods and results to complement AI-generated manuscripts.
- Tools/workflows: Direct coupling of AutoFigure with autonomous research agents; continuous figure regeneration as hypotheses evolve.
- Assumptions/dependencies: Reliable grounding from machine-generated texts; provenance tracking; strict human oversight.
- Journal and conference submission pipelines with auto-figure checks
- Sectors: academic publishing
- What: Integrated modules that generate, validate, and standardize figures during camera-ready preparation, with attribution policies built-in.
- Tools/workflows: Publisher-side plugins using VLM-as-a-judge for compliance and clarity; AutoFigure-Edit for editorial revisions.
- Assumptions/dependencies: Publisher adoption; community standards for AI-assisted figures; transparent attribution.
- Deep editor integrations for interactive co-design
- Sectors: creative software, design tools
- What: Round-trip editing with Figma/Illustrator/PowerPoint—generate from text, then edit vector layers and sync back to text spec.
- Tools/workflows: AutoFigure-Edit SDK and plugins; layer-level semantics preservation.
- Assumptions/dependencies: Vendor APIs; stable vector semantics; UI/UX for human-AI co-creation.
- Domain-specific iconography and ontology packs
- Sectors: healthcare, energy, finance, robotics, cybersecurity
- What: Curated symbol sets and relation schemas to boost accuracy/readability for specialized diagrams (e.g., NIST CSF, ISO 27001, clinical ontologies).
- Tools/workflows: Ontology-aware concept extraction; style+icon packs; validation against domain checklists.
- Assumptions/dependencies: Standards alignment; continuous maintenance; expert curation.
- Multilingual and accessibility-first rendering
- Sectors: global education, public sector, consumer tech
- What: Localized labels and auto alt-text/captions to meet accessibility and localization requirements at scale.
- Tools/workflows: Multilingual OCR/verification; typography that supports diverse scripts; screen-reader metadata export.
- Assumptions/dependencies: High-accuracy OCR and NMT for domain terms; locale-specific review.
- Hybrid conceptual + data-driven figure generation
- Sectors: science/engineering, business analytics
- What: Combine conceptual schematics with auto-generated charts from code/notebooks while preserving semantic alignment.
- Tools/workflows: Code-to-figure bridges (e.g., chart specs -> vector layers); consistency checks via structured layouts.
- Assumptions/dependencies: Reliable chart synthesis; reproducibility pipelines; clear distinction between conceptual vs. empirical visuals.
- Regulatory transparency and provenance tooling
- Sectors: government, finance, healthcare
- What: Automatically generated policy/process diagrams with embedded provenance links mapping each visual element to source clauses.
- Tools/workflows: Traceability layers; audit logs of critique-and-refine steps; version-controlled SVG/HTML.
- Assumptions/dependencies: Policy-text parsing robustness; governance policies for AI-assisted documentation.
- Adaptive education: level- and persona-aware visuals
- Sectors: education, corporate training
- What: Generate multiple figure variants matched to learner level (novice/expert) or role (engineer/PM).
- Tools/workflows: Style and density control conditioned on learner profiles; A/B testing with VLM-as-a-judge and human feedback.
- Assumptions/dependencies: Learner modeling; pedagogy research; bias and fairness monitoring.
- Living documentation for enterprises
- Sectors: enterprise IT, compliance, operations
- What: Continuous regeneration of diagrams as documents evolve (SOPs, architecture, controls), with change diffs and alerts.
- Tools/workflows: Doc-change triggers; test-time scaling of critique iterations for high-stakes updates; review queues.
- Assumptions/dependencies: Access controls; compute budget; integration with document repositories.
- Benchmark-driven procurement and model evaluation
- Sectors: enterprise AI, public sector
- What: Use FigureBench as a standard to evaluate vendors’ figure-generation or doc-automation tools (clarity, accuracy, aesthetics).
- Tools/workflows: Internal FigureBench subsets tailored to domain; periodic bake-offs; acceptance thresholds.
- Assumptions/dependencies: Community consensus on metrics; dataset extensions for specific domains; evolving VLM judges.
Notes on Feasibility and Risk
- Dependencies on LLM/VLM backbones: Performance and long-context reasoning rely on access to strong models (as evaluated in the paper). On-prem or private deployments may require fine-tuning or distillation.
- Scope limitations: The method targets conceptual diagrams; quantitative charts/plots are out of scope unless paired with separate chart-generation pipelines.
- Human oversight: Authors emphasize ethical risks (misleading schematics). Adoption should include human verification, transparent attribution, and provenance.
- Data/security: Deployments in regulated sectors must ensure data privacy and licensing compliance for inputs and outputs.
- Compute and cost: Test-time critique-and-refine iterations improve quality but increase latency and cost; budgets and SLAs should be planned accordingly.
Glossary
- Ablation studies: Controlled experiments that remove or vary components to assess their contribution to system performance. "including automated evaluations (§5.1), human evalu- ation (§5.2), and controlled ablation studies (§5.3)"
- Agentic framework: A system that uses autonomous, goal-directed agents to plan and execute a complex pipeline. "we propose AUTOFIGURE, the first agentic framework that automatically generates high-quality scientific illus- trations based on long-form scientific text."
- Aesthetic fluency: The smooth, visually pleasing quality of a design that supports readability and professional polish. "balancing these rigid constraints with the aesthetic fluency and readability required for publication standards"
- Blind pairwise comparison: An evaluation where two options are compared without revealing which is the reference, to reduce bias. "(2) blind pairwise comparison."
- Cohen’s kappa (Cohen’s k): A statistic measuring inter-annotator agreement beyond chance. "with a high Inter-Rater Reliability (IRR, Cohen's k = 0.91)"
- Conditioning image: An auxiliary, machine-readable visual input that guides a generative model’s output. "converting the unstruc- tured long-form scientific text into a structured, machine-readable conditioning image"
- Critique-and-Refine: An iterative loop where a “critic” analyzes a draft and a “designer” improves it based on feedback. "Critique-and-Refine. This step is the core of our "thinking" process, implementing a self- refinement loop that simulates a dialogue between an AI "designer" and an AI "critic", aiming to find the globally optimal layout through iterative search."
- Decoupled generative paradigm: A design that separates reasoning/planning from rendering to improve control and quality. "a decoupled generative paradigm for high-fidelity scientific illus- tration generation."
- Diffusion models: Generative models that synthesize images by iteratively denoising noise, conditioned on input text or other signals. "Recent progress in diffusion models (Song et al., 2021) have greatly improved the performance of T2I generation"
- Directed graph: A graph with edges that have a direction, used here to encode relationships among entities in a layout. "encodes a directed graph Go = (Vo, Eo)."
- Erase-and-correct: A post-processing strategy that removes rasterized text and replaces it with corrected, crisp vector text. "We improve text legibility via an erase-and-correct process."
- FID (Fréchet Inception Distance): A metric that measures distributional distance between sets of images, often misaligned with schematic correctness. "traditional T2I met- rics (e.g., FID (Jayasumana et al., 2024)) are usually misaligned with the requirements for logi- cal and topological correctness."
- Inter-Rater Reliability (IRR): A measure of consistency among multiple annotators’ judgments. "with a high Inter-Rater Reliability (IRR, Cohen's k = 0.91)"
- Layout planning: The step of organizing elements spatially and structurally before rendering. "Semantic Parsing and Layout Planning"
- Likert scale: An ordinal rating scale commonly used in surveys, here from 1 to 5. "rated on a 1-5. Likert scale for Accuracy, Clarity, and Aesthetics."
- Long-context reasoning: The ability to process and reason over very large texts or contexts. "highlighting the need for robust long-context reasoning."
- Long-context Scientific Illustration Design: The task of generating figures from entire long documents rather than short captions. "specifically targeting Long- context Scientific Illustration Design"
- Markup-based symbolic layout: A structured representation of a figure in a markup language (e.g., SVG/HTML) encoding geometry and relations. "a markup- based symbolic layout S0 (SVG/HTML)"
- Multimodal generative model: A model that takes inputs from multiple modalities (e.g., text plus layout) to generate images. "These inputs are fed into a multimodal generative model to render an image Ipolished"
- OCR (Optical Character Recognition): Technology that extracts text strings and bounding boxes from images. "a OCR engine ocr extracts preliminary strings and bounding boxes"
- Publication-ready: Meeting the quality standards (accuracy and aesthetics) suitable for inclusion in academic venues. "producing publication-ready scientific il- lustrations."
- Rasterize: Convert vector or symbolic layouts into pixel-based images. "We additionally rasterize S0 into a layout reference image I0"
- Reasoned Rendering: A paradigm that first reasons about structure and style, then renders, to balance fidelity and aesthetics. "an agentic framework based on the Rea- soned Rendering paradigm."
- Referenced scoring: Evaluation where the model’s output is judged with access to the reference figure and source text. "Referenced scoring, where a VLM is provided with the full text, the ground-truth figure, and the generated image."
- Self-refinement loop: An iterative optimization where the system critiques and improves its own outputs over multiple rounds. "a novel self- refinement loop-simulating a dialogue between an AI designer and critic-iteratively optimizes this blueprint"
- Semantic parsing: Converting unstructured text into a structured representation (entities, relations, layout). "Semantic Parsing and Layout Planning"
- Structural fidelity: The degree to which a generated figure preserves the intended structure and relationships. "they struggle to pre- serve structural fidelity (Liu et al., 2025)."
- Symbolic blueprint: A high-level, discrete plan of the figure’s elements and relations before rendering. "distilling unstructured text into a structured, symbolic blueprint."
- Symbolic layout: A machine-readable, structured depiction of figure geometry and topology (e.g., in SVG/HTML). "a machine-readable symbolic layout S0 (e.g., SVG/HTML)"
- Test-time scaling: Improving performance by increasing the number of inference-time refinement iterations. "we conduct a test-time scaling experiment"
- Text-to-image (T2I): Models that generate images directly from textual descriptions. "mainstream end-to-end text-to-image (T2I) models"
- Topological correctness: Accuracy of connectivity and spatial relationships among components in a diagram. "requirements for logi- cal and topological correctness."
- Vector-text overlays: Rendering text as resolution-independent vectors over an image for crisp legibility. "we render Tcorr as vector-text overlays at Cocr on top of Ierased"
- Vision-LLM (VLM): A model that jointly processes visual and textual inputs for understanding or evaluation. "to fine-tune a vision-LLM."
- VLM-as-a-judge paradigm: Using a VLM to evaluate generated images against text and references. "our evaluation protocol leverages the VLM-as-a-judge paradigm"
- Win-Rate: The percentage of times a method’s output is preferred in pairwise comparisons. "Win-Rate is calculated through blind pairwise comparisons against the reference"
Collections
Sign up for free to add this paper to one or more collections.