AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations

Published 3 Feb 2026 in cs.AI, cs.CL, cs.CV, and cs.DL | (2602.03828v1)

Abstract: High-quality scientific illustrations are crucial for effectively communicating complex scientific and technical concepts, yet their manual creation remains a well-recognized bottleneck in both academia and industry. We present FigureBench, the first large-scale benchmark for generating scientific illustrations from long-form scientific texts. It contains 3,300 high-quality scientific text-figure pairs, covering diverse text-to-illustration tasks from scientific papers, surveys, blogs, and textbooks. Moreover, we propose AutoFigure, the first agentic framework that automatically generates high-quality scientific illustrations based on long-form scientific text. Specifically, before rendering the final result, AutoFigure engages in extensive thinking, recombination, and validation to produce a layout that is both structurally sound and aesthetically refined, outputting a scientific illustration that achieves both structural completeness and aesthetic appeal. Leveraging the high-quality data from FigureBench, we conduct extensive experiments to test the performance of AutoFigure against various baseline methods. The results demonstrate that AutoFigure consistently surpasses all baseline methods, producing publication-ready scientific illustrations. The code, dataset and huggingface space are released in https://github.com/ResearAI/AutoFigure.

Abstract PDF Upgrade to Chat

Summary

The paper introduces AutoFigure, a novel agentic framework that automates the creation of publication-ready scientific illustrations from long-form texts.
It employs a decoupled pipeline with concept extraction, iterative critique-and-refine, and style-guided rendering to ensure semantic alignment and visual excellence.
Quantitative and expert evaluations show its strong performance, with win rates up to 97.5% in textbook diagrams and significant acceptance for camera-ready publications.

AutoFigure: Automated Generation of Publication-Ready Scientific Illustrations

Introduction and Context

"AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations" (2602.03828) addresses a persistent and impactful bottleneck in scientific communication: the manual creation of high-quality, publication-ready illustrations from long-form scientific texts. While existing automated methods focus primarily on extracting and recombining pre-existing figures or rendering captions into images, there is a notable deficit in automated systems capable of distilling the core methodologies and conceptual workflows from extensive documents (often exceeding 10,000 tokens) and synthesizing novel, semantically aligned, and visually compelling illustrations ab initio.

The work positions itself amid two major research threads. First, the development of datasets and benchmarks for scientific figure generation, which have until now relied upon succinct, structurally impoverished inputs such as captions or figure legends. Second, recent advances in agentic systems for multimodal scientific content synthesis (e.g., PosterAgent, PPTAgent) that perform extraction and reformatting but lack the capacity for true conceptual generation from unstructured text. Both approaches fail to simultaneously achieve structural fidelity, aesthetic appeal, and content completeness—three critical dimensions for figures put forward in peer-reviewed publications.

FigureBench: Foundation for Benchmarking Scientific Illustration Generation

A primary contribution of the paper is FigureBench, the first large-scale benchmark explicitly designed for Long-context Scientific Illustration Design. The dataset comprises 3,300 expert-annotated, high-complexity text-figure pairs drawn from research papers, surveys, technical blogs, and textbooks. Notably, the dataset emphasizes conceptual illustrations over data visualizations and ensures exhaustive grounding of all visual elements in the source text. The construction pipeline utilizes expert raters, achieving high inter-rater reliability (Cohen's κ = 0.91), and is further refined via fine-tuned vision-LLMs to filter and expand the development set.

Statistical analysis highlights the significant challenge imposed by FigureBench: average text length in the "Paper" category is over 12,000 tokens, with high text density (avg. 41.2%), and complex structural characteristics (avg. 6.2 colors, 5.3 components, 6.4 shapes per illustration). This positions FigureBench as a robust evaluation bed for models required to operate at the intersection of logical reasoning, design, and multimodal generation.

AutoFigure: Agentic, Decoupled Reasoned Rendering

The core technical contribution is the AutoFigure framework, which employs an agentic, decoupled pipeline based on a Reasoned Rendering paradigm. The process is decomposed into distinct stages optimized for semantic reasoning, layout optimization, and high-fidelity rendering:

Stage I: Concept Extraction and Symbolic Layout Generation

A LLM agent parses long-form scientific text and distills a methodological summary alongside explicit entities and relations, serialized as a structured, machine-readable graph (e.g., SVG/HTML).
This layout encodes the 2D geometry and topology required for downstream schematic generation.

Stage II: Critique-and-Refine Self-Refinement Loop

An iterative loop simulates a dialogue between an AI designer and an AI critic, refining layout drafts for improved logical coherence, balance, and absence of visual artifacts.
Critique-derived feedback steers subsequent generations, with quantitative scoring guiding convergence toward optimal layout hypotheses.

Stage III: Style-Guided Rendering and Text Post-processing

Layout and style descriptors are passed to a multimodal generative model, which synthesizes the final image, conditioned to maintain structural constraints.
An "erase-and-correct" module remedies the typical failures of T2I models in rendering crisp vector text: text is removed using an eraser, OCR extraction is conducted, candidate strings are verified and corrected against the symbolic layout, and vector-quality overlays are composed onto the image.

This agentic, compositional pipeline is architected for modular extension, allowing direct substitution of reasoning, rendering, and verification backbones.

Evaluation: Quantitative, Qualitative, Human Expert Assessment

Automated and Human Evaluations

AutoFigure is evaluated via a comprehensive protocol that combines "VLM-as-a-Judge" referenced and blind pairwise scoring, supplemented by rigorous human expert review. The VLM-based evaluation decomposes figure quality into visual design (aesthetic, expressiveness, polish), communication effectiveness (clarity, logical flow), and content fidelity (accuracy, completeness, appropriateness); both absolute scores and win-rates in head-to-head comparisons are reported for each dimension and class of document.

Key Results:

Consistent outperformance of all baselines (T2I models, SVG/HTML code generation, agentic planners) in both overall scores and win rates across blog, survey, textbook, and—most notably—paper categories.
AutoFigure achieves an exceptional 97.5% win rate in the textbook category and 66.7% of domain experts are willing to use its figures in camera-ready papers, indicating acceptance against real-world publication standards.
The ablation studies demonstrate that both the iterative critique-and-refine loop and the "erase-and-correct" rendering add substantial value, with multi-iteration search boosting overall performance from 6.28 to 7.14, and post-processing yielding up to +0.10 in crucial visual quality metrics.

Qualitative analysis identifies that T2I baselines typically fail on logical structure, while code-generation methods produce visually dry, fragmented output. AutoFigure uniquely integrates hierarchical semantic decomposition, element grouping, and role-specific styling, successfully synthesizing novel visual grammars in research diagrams with diverse conventions.

Open-Source Backbone Evaluation

AutoFigure's modularity is reinforced by competitive performance with open-source large-scale VLMs; Qwen3-VL-235B achieves an overall score of 7.08, rivaling GPT-5 and surpassing several proprietary alternatives, validating reproducibility and accessibility to the broader research community.

Human-LLM Correlation

The evaluation paradigm is further validated through statistical analysis of human-LLM agreement, achieving high Pearson (r = 0.659) and Spearman (ρ = 0.593) correlations on quality and ranking tasks, indicating alignment between human and VLM judgments across all tested dimensions.

Limitations and Theoretical Implications

Despite demonstrable strengths, AutoFigure still exhibits limitations:

Text rendering at fine granularity remains a weakness under dense layouts or small font sizes, with rare character-level errors persisting despite the "erase-and-correct" module.
Boundary adherence in semantic-to-visual mapping: The system can drift beyond strict literal content when grappling with abstract, underspecified source text, sometimes sacrificing precise ontological distinctions for visual clarity.
Hierarchical complexity bottleneck: Research paper diagrams, with multi-level entity organization and minimal visual precedent, continue to expose the limits of current layout-planning capacity.

These observations highlight open research challenges at the intersection of knowledge-grounded generation, formal logical reasoning, and scalable vector rendering—particularly the need for domain-verifiers and knowledge-augmented constraint enforcement.

Broader Impact and Future Development

AutoFigure represents a significant advancement toward fully automated scientific communication agents, directly addressing the present-day bottleneck impeding autonomous AI-driven research systems: the inability to visually articulate findings at human publication standards. The implications span both practical workflows—democratizing access to publication-grade figures for researchers without design expertise—and theoretical AI capabilities—moving toward agents capable of both discovering and communicating complex phenomena in a self-contained manner.

Possible future directions include:

Extension to other domains (life sciences, economics) requiring incorporation of domain-specific visual conventions.
Interactive and dynamic schematic generation (e.g., animated, exploratory, or temporally contextualized figures).
Incorporation of verification-oriented modules (retrieval-augmented, rule-based domain checkers) to close the fidelity gap in complex, layered diagrams.
Enhanced multimodal interpretability tools for AI-generated scientific outputs within AI scientist pipelines.

Conclusion

By coupling FigureBench, a demanding benchmark that enforces high standards on both reasoning and design, with the AutoFigure Reasoned Rendering agentic pipeline, this work demonstrates tangible progress in automating a critical aspect of scientific knowledge dissemination. The framework resolves core trade-offs in existing methods, enabling both semantic alignment and aesthetic achievement, and is validated through rigorous multi-modal, multi-perspective evaluation. While open problems remain in handling the full complexity of scientific visual language and ultra-fine text rendering, these contributions lay a strong, extensible foundation for the next generation of AI-driven scientific research and communication systems (2602.03828).

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper is about teaching AI to make clear, good-looking scientific diagrams from long pieces of scientific writing. The authors built a large dataset called FigureBench and a smart system called AutoFigure that reads complex text, plans a clean layout, and then draws a polished illustration that’s ready to go in a real research paper.

What questions did the researchers ask?

Can an AI read long, detailed scientific text and turn it into a diagram that is both accurate and easy to understand?
What steps does the AI need to take to keep the diagram correct while also looking professional?
How can we fairly test and compare different AI methods that try to make scientific illustrations?
Will human experts (the authors of the papers) accept these AI-made figures for publication?

How did they do it?

The dataset: FigureBench

The team created FigureBench, a collection of 3,300 pairs of scientific texts and their matching figures. It includes many types of documents, such as research papers, surveys, blogs, and textbooks. They carefully built a 300-example test set (200 from papers and 100 from other sources) and checked quality with two human reviewers who agreed at a very high rate (Cohen’s kappa = 0.91, which means the reviewers consistently agreed beyond random chance). The remaining 3,000 examples are for development and future training by the community.

Why this matters: If you want to judge how well an AI can make figures, you need lots of real examples that cover different styles and levels of complexity. FigureBench provides that.

The tool: AutoFigure

AutoFigure creates figures in two main stages, using a “think-then-draw” approach:

Stage I — Planning the “blueprint”
- The AI reads the long text and extracts the main ideas (like the steps in a method, the components of a system, and how they connect).
- It turns these ideas into a structured “blueprint” of the figure (using formats like SVG/HTML). Think of this like a map that says, “Put a box here, an arrow there, and label this part with that name.”
- It uses an internal loop with two roles: an AI “designer” and an AI “critic.” The designer proposes a layout; the critic points out problems (like overlapping elements or confusing flow) and suggests improvements. They repeat until the blueprint is balanced and clear.
Stage II — Drawing and fixing text
- The AI converts the blueprint into a high-quality image with the chosen style (colors, shapes, and overall look).
- Text in images can be blurry, so AutoFigure uses an “erase-and-correct” trick:
  - It erases the text from the image background.
  - It uses OCR (a tool that reads text in images) to find the labels.
  - It checks and corrects the labels based on the blueprint.
  - It overlays crisp, sharp text back onto the image.
- This ensures both the structure and the words are correct and readable.

How they judged the results

AI-as-a-judge: They used a vision-LLM (an AI that understands both images and text) to score figures for:
- Visual design (looks and professional finish),
- Communication (clarity and logical flow),
- Content fidelity (accuracy and completeness).
Blind comparisons: The AI judge saw the text plus two figures (one original, one AI-made) and had to pick which was better without knowing which was which.
Human experts: They asked first authors of 21 papers to rate figures for accuracy, clarity, and aesthetics, and to decide which ones they’d actually publish.

What did they find?

Here are the main results from their tests and comparisons:

AutoFigure consistently beat other methods across document types (blogs, surveys, textbooks, and papers). It had high overall scores and win-rates in blind comparisons.
In textbooks, AutoFigure’s figures won 97.5% of comparisons—showing strong clarity and teaching quality.
In papers, AutoFigure still led, with a 53.0% win-rate against the original references in blind tests (a tough category with very detailed content).
Human experts judged AutoFigure’s figures far better than other AI systems. Most importantly, 66.7% said they would include AutoFigure’s figures in their final, camera-ready paper.
The “designer-and-critic” refinement loop clearly helped: more thinking iterations improved scores.
Stronger reasoning models and structured formats (like SVG/HTML) produced better figures than less structured approaches (like building slides piece by piece).

Why this matters: It shows that careful planning (the blueprint) plus thoughtful rendering (the final image) helps an AI produce figures that are both accurate and attractive—good enough for real science communication.

Why does this research matter?

Scientific diagrams help people understand complex ideas quickly. Good figures can save researchers and students hours of confusion.
Making these by hand takes a lot of time and skill. AutoFigure can speed this up, helping scientists share results more clearly and faster.
As AI systems start doing more science tasks, they need to explain their ideas visually. AutoFigure fills a big gap by turning complex findings into understandable diagrams.

What could happen next?

Researchers could use AutoFigure in writing tools (like paper editors or slide makers) to auto-generate figures from their drafts.
Teachers and students might use it to turn lessons or study notes into clear visual summaries.
The released dataset and code can push the field forward, inspiring better tools and new research.
The authors also discuss ethics: powerful tools can be misused to make misleading figures. To reduce this risk, they require clear attribution and disclaimers, reminding everyone that AI outputs still need expert checking.

Key terms explained

Blueprint (layout): A detailed plan of where boxes, arrows, and labels go before drawing the final image.
SVG/HTML: Text-based formats that describe shapes and positions—like a recipe for a drawing.
OCR (Optical Character Recognition): A tool that reads text inside images and turns it into editable text.
Vision-LLM (AI judge): An AI that can look at pictures, read text, and give informed scores or choices.
Cohen’s kappa: A statistic that shows how much two reviewers agree beyond random chance. A high value (like 0.91) means strong agreement.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of unresolved issues, constraints, and open research directions that the paper does not fully address but that future work could concretely tackle:

Dataset coverage and scope: The benchmark excludes data-driven charts/plots and may underrepresent multi-panel figures common in publications; extend FigureBench to include quantitative visualizations, multi-panel layouts with subfigure labels, and hybrid figures (schematics + charts).
Domain generalization: Most examples appear within CS/AI; assess performance across other disciplines (biology, chemistry, physics, medical imaging, social sciences) where visual conventions and semantics differ.
Multilingual capability: The pipeline and OCR/text refinement are evaluated on English; evaluate and adapt for multilingual long-form texts, right-to-left scripts, and mixed-language figures.
Mathematical typesetting: The erase-and-correct stage does not address LaTeX/math equations, special symbols, or inline formula rendering; add native equation parsing and vector LaTeX overlays with typographic fidelity.
Accessibility and compliance: No analysis of colorblind-safe palettes, font legibility at journal column widths, contrast compliance (e.g., WCAG), or alt-text generation; introduce accessibility checks and auto-corrections.
Ground-truth structure for evaluation: Current evaluation relies on VLM judgments; provide ground-truth symbolic layouts/graphs for test figures to enable structural metrics (graph edit distance, topology preservation, alignment scores).
VLM-as-a-judge reliability: Quantify and improve the consistency, calibration, and bias of VLM evaluations across different judges; measure correlation with large-scale human expert ratings and inter-rater reliability beyond the small human study.
Small human evaluation sample: The expert study covers 10 authors and 21 papers; expand to statistically powered, multi-disciplinary evaluations with detailed error analyses and publication outcomes.
Dependency on proprietary models: The pipeline and evaluations rely on closed-source models (e.g., GPT-5, Gemini-2.5, Claude-4.1); reproduce results using open-source backbones to assess replicability and reduce evaluation circularity.
Curation bias in FigureBench: GPT-5 was used to select representative figures and a VLM filter built from the curated set was used to expand the dev set; quantify potential selection/model biases and validate with independent human-only curation.
Robustness to imperfect texts: The dataset emphasizes figures where “each key visual element is explicitly described”; evaluate performance on realistic papers with implicit, incomplete, or noisy descriptions and conflicting statements.
Entity-to-icon semantics: The mapping from extracted entities/relations to visual metaphors/icons is not formally specified or validated; develop taxonomies, mapping rules, and semantic consistency checks for iconography.
Convergence guarantees for critique-and-refine: The iterative loop stops at fixed N iterations; study convergence behavior, stopping criteria, and diminishing returns, and propose principled optimization objectives for the scoring function q.
Risk of “judge overfitting”: The refinement loop optimizes using VLM feedback that resembles the evaluation protocol; test with disjoint judge models and adversarial evaluation to rule out optimization-to-the-metric artifacts.
Style adherence and diversity: The paper presents mostly a unified default style; systematically evaluate style controllability, adherence to user-specified style guides (e.g., ACM/IEEE), and diversity across multiple aesthetics.
Vector end-to-end outputs: The final rendering produces raster backgrounds with vector text overlays; investigate fully vector outputs (including shapes/icons) for editability, scalability, and print fidelity (e.g., SVG/TikZ end-to-end).
Support for complex layouts: Assess performance on deeply nested hierarchies, long pipelines (>20 components), densely annotated diagrams, and figures with layered semantics (e.g., zoomed insets, callouts, legends).
Integration of data visualizations: Extend the method to automatically generate charts/plots from referenced datasets or tables and ensure semantic consistency between schematics and quantitative visuals.
Error analysis of text refinement: Provide a breakdown of OCR/verification failures (e.g., bounding box mismatches, misalignments, font substitutions, hyphenation issues) and quantify improvements from the erase-and-correct module.
Cost, latency, and scalability: The efficiency analysis is deferred to the appendix; report detailed compute profiles (time per figure, memory, model calls), cost sensitivity to iteration count, and throughput in realistic workflows.
Editing and iteration in practice: AutoFigure-Edit is mentioned but not evaluated; benchmark interactive editing workflows (icon replacement, re-layout, text edits) and measure human-in-the-loop productivity gains.
Adversarial and safety robustness: Evaluate susceptibility to generating misleading-but-plausible diagrams from adversarial or ambiguous inputs; add factual consistency checks, provenance tracking, and usage constraints.
Journal/publisher compliance: Test alignment with specific venue guidelines (font sizes, line weights, color use, resolution, captioning standards) and provide automatic validation/fix-ups prior to camera-ready submission.
Reproducibility of prompts and schemas: Although prompts are in the appendix, assess robustness to prompt phrasing changes, release standardized I/O schemas, and publish ablations on prompt sensitivity.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following use cases can be deployed with the paper’s released codebase, dataset, and demo (AutoFigure, AutoFigure-Edit, FigureBench), leveraging the Reasoned Rendering pipeline (semantic parsing + critique-and-refine + aesthetic rendering with erase-and-correct).

Academic publishing: rapid method-figure drafting
- Sectors: academia, publishing, software (tools)
- What: Generate publication-ready conceptual schematics from method/results sections to shorten authoring cycles.
- Tools/workflows: AutoFigure integrated into LaTeX/Overleaf or Word workflows; “AutoFigure-Edit” for post-tweaks; export SVG/HTML for journals.
- Assumptions/dependencies: Works best for conceptual (not data-plot) figures; requires long-form, well-structured text; human verification recommended (ethical note from paper).
Grant proposals and peer reviews: concise mechanism diagrams
- Sectors: academia, funding agencies
- What: Auto-generate pipeline overviews for proposals; reviewers can request/produce quick visual summaries of complex methods.
- Tools/workflows: Document editor plugins; batch generation of figures from proposal drafts; VLM-as-a-judge scoring for quality checks.
- Assumptions/dependencies: Long-context LLM availability; confidentiality safeguards for sensitive submissions.
Technical documentation and software architecture diagrams
- Sectors: software, DevOps, cybersecurity
- What: Convert design documents/ADRs/runbooks into system diagrams (microservices, data flows, threat models).
- Tools/workflows: GitHub Actions/CI step to render figures from markdown specs; wiki/CMS plugin; SVG output for versioning.
- Assumptions/dependencies: Access to internal docs; accurate entity/relation extraction from prose; security reviews for on-prem deployment.
Manufacturing and operations: SOP/process schematics
- Sectors: manufacturing, logistics, robotics
- What: Transform standard operating procedures and maintenance manuals into step-by-step flow diagrams for training and audits.
- Tools/workflows: LMS integration; AutoFigure-Edit for icon/text adjustments; printable SVG/PDF output.
- Assumptions/dependencies: Domain terminology mapping; human-in-the-loop validation for safety-critical steps.
Healthcare communication: clinical pathway and patient education visuals
- Sectors: healthcare, life sciences
- What: Create clear care-pathway, triage, or procedural diagrams from protocols and patient-facing guides.
- Tools/workflows: EHR/clinical knowledge-base export to long-form text; rendering with erase-and-correct for crisp medical terminology.
- Assumptions/dependencies: Medical review for correctness; HIPAA-compliant deployment; current scope excludes quantitative charts.
Education content creation: textbook- and MOOC-style diagrams
- Sectors: education, edtech
- What: Generate instructional figures from lecture notes/chapters (taxonomies, pipelines, timelines) in consistent styles.
- Tools/workflows: LMS or CMS plugin; style presets (e.g., “textbook” vs. “blog”) to align with course branding.
- Assumptions/dependencies: Teacher approval; alignment to pedagogy; multilingual labels may need manual QA.
Science communication and media: blog/tech explainer visuals
- Sectors: media, technical marketing
- What: Produce visually polished, accurate schematics from long-form explainers and whitepapers.
- Tools/workflows: CMS extension; editorial style sheets; VLM-as-a-judge for pre-publication QA.
- Assumptions/dependencies: Editorial standards; brand color/style libraries; careful claims review to avoid miscommunication.
Compliance and policy briefs: regulatory process flowcharts
- Sectors: public sector, finance, energy
- What: Convert regulations/standards into flow diagrams showing decision points and obligations.
- Tools/workflows: Document-to-diagram batch pipelines; SVG for collaborative review; traceability notes embedded in layers.
- Assumptions/dependencies: Legal review; reliance on text fidelity; complex cross-references may require iterative refinement.
Patent drafting support: conceptual drawings from specifications
- Sectors: legal/IP, R&D
- What: Draft patent figures illustrating system components and interactions from specification text.
- Tools/workflows: Patent-counsel workflow plugin; export to compliant vector formats for patent office submissions.
- Assumptions/dependencies: Jurisdiction-specific drawing rules; attorney oversight; conceptual scope (not detailed mechanical CAD).
Figure quality QA and remediation for existing documents
- Sectors: publishing, enterprise documentation
- What: Use the VLM-as-a-judge evaluation protocol to score clarity/accuracy/aesthetics of figures; regenerate or refine low-scoring ones.
- Tools/workflows: FigureBench-based automated scoring; AutoFigure regeneration with critique-and-refine; text legibility fixes via erase-and-correct.
- Assumptions/dependencies: Availability of underlying text; VLM-judge alignment with human preferences; maintaining provenance.

Long-Term Applications

These use cases require further research, scaling, or ecosystem adoption (e.g., tighter editor integrations, domain ontologies, or regulatory acceptance).

End-to-end “AI Scientist” visualization module
- Sectors: AI research, R&D automation
- What: Seamless visualization of automatically discovered methods and results to complement AI-generated manuscripts.
- Tools/workflows: Direct coupling of AutoFigure with autonomous research agents; continuous figure regeneration as hypotheses evolve.
- Assumptions/dependencies: Reliable grounding from machine-generated texts; provenance tracking; strict human oversight.
Journal and conference submission pipelines with auto-figure checks
- Sectors: academic publishing
- What: Integrated modules that generate, validate, and standardize figures during camera-ready preparation, with attribution policies built-in.
- Tools/workflows: Publisher-side plugins using VLM-as-a-judge for compliance and clarity; AutoFigure-Edit for editorial revisions.
- Assumptions/dependencies: Publisher adoption; community standards for AI-assisted figures; transparent attribution.
Deep editor integrations for interactive co-design
- Sectors: creative software, design tools
- What: Round-trip editing with Figma/Illustrator/PowerPoint—generate from text, then edit vector layers and sync back to text spec.
- Tools/workflows: AutoFigure-Edit SDK and plugins; layer-level semantics preservation.
- Assumptions/dependencies: Vendor APIs; stable vector semantics; UI/UX for human-AI co-creation.
Domain-specific iconography and ontology packs
- Sectors: healthcare, energy, finance, robotics, cybersecurity
- What: Curated symbol sets and relation schemas to boost accuracy/readability for specialized diagrams (e.g., NIST CSF, ISO 27001, clinical ontologies).
- Tools/workflows: Ontology-aware concept extraction; style+icon packs; validation against domain checklists.
- Assumptions/dependencies: Standards alignment; continuous maintenance; expert curation.
Multilingual and accessibility-first rendering
- Sectors: global education, public sector, consumer tech
- What: Localized labels and auto alt-text/captions to meet accessibility and localization requirements at scale.
- Tools/workflows: Multilingual OCR/verification; typography that supports diverse scripts; screen-reader metadata export.
- Assumptions/dependencies: High-accuracy OCR and NMT for domain terms; locale-specific review.
Hybrid conceptual + data-driven figure generation
- Sectors: science/engineering, business analytics
- What: Combine conceptual schematics with auto-generated charts from code/notebooks while preserving semantic alignment.
- Tools/workflows: Code-to-figure bridges (e.g., chart specs -> vector layers); consistency checks via structured layouts.
- Assumptions/dependencies: Reliable chart synthesis; reproducibility pipelines; clear distinction between conceptual vs. empirical visuals.
Regulatory transparency and provenance tooling
- Sectors: government, finance, healthcare
- What: Automatically generated policy/process diagrams with embedded provenance links mapping each visual element to source clauses.
- Tools/workflows: Traceability layers; audit logs of critique-and-refine steps; version-controlled SVG/HTML.
- Assumptions/dependencies: Policy-text parsing robustness; governance policies for AI-assisted documentation.
Adaptive education: level- and persona-aware visuals
- Sectors: education, corporate training
- What: Generate multiple figure variants matched to learner level (novice/expert) or role (engineer/PM).
- Tools/workflows: Style and density control conditioned on learner profiles; A/B testing with VLM-as-a-judge and human feedback.
- Assumptions/dependencies: Learner modeling; pedagogy research; bias and fairness monitoring.
Living documentation for enterprises
- Sectors: enterprise IT, compliance, operations
- What: Continuous regeneration of diagrams as documents evolve (SOPs, architecture, controls), with change diffs and alerts.
- Tools/workflows: Doc-change triggers; test-time scaling of critique iterations for high-stakes updates; review queues.
- Assumptions/dependencies: Access controls; compute budget; integration with document repositories.
Benchmark-driven procurement and model evaluation
- Sectors: enterprise AI, public sector
- What: Use FigureBench as a standard to evaluate vendors’ figure-generation or doc-automation tools (clarity, accuracy, aesthetics).
- Tools/workflows: Internal FigureBench subsets tailored to domain; periodic bake-offs; acceptance thresholds.
- Assumptions/dependencies: Community consensus on metrics; dataset extensions for specific domains; evolving VLM judges.

Notes on Feasibility and Risk

Dependencies on LLM/VLM backbones: Performance and long-context reasoning rely on access to strong models (as evaluated in the paper). On-prem or private deployments may require fine-tuning or distillation.
Scope limitations: The method targets conceptual diagrams; quantitative charts/plots are out of scope unless paired with separate chart-generation pipelines.
Human oversight: Authors emphasize ethical risks (misleading schematics). Adoption should include human verification, transparent attribution, and provenance.
Data/security: Deployments in regulated sectors must ensure data privacy and licensing compliance for inputs and outputs.
Compute and cost: Test-time critique-and-refine iterations improve quality but increase latency and cost; budgets and SLAs should be planned accordingly.

View Paper Prompt View All Prompts

Glossary

Ablation studies: Controlled experiments that remove or vary components to assess their contribution to system performance. "including automated evaluations (§5.1), human evalu- ation (§5.2), and controlled ablation studies (§5.3)"
Agentic framework: A system that uses autonomous, goal-directed agents to plan and execute a complex pipeline. "we propose AUTOFIGURE, the first agentic framework that automatically generates high-quality scientific illus- trations based on long-form scientific text."
Aesthetic fluency: The smooth, visually pleasing quality of a design that supports readability and professional polish. "balancing these rigid constraints with the aesthetic fluency and readability required for publication standards"
Blind pairwise comparison: An evaluation where two options are compared without revealing which is the reference, to reduce bias. "(2) blind pairwise comparison."
Cohen’s kappa (Cohen’s k): A statistic measuring inter-annotator agreement beyond chance. "with a high Inter-Rater Reliability (IRR, Cohen's k = 0.91)"
Conditioning image: An auxiliary, machine-readable visual input that guides a generative model’s output. "converting the unstruc- tured long-form scientific text into a structured, machine-readable conditioning image"
Critique-and-Refine: An iterative loop where a “critic” analyzes a draft and a “designer” improves it based on feedback. "Critique-and-Refine. This step is the core of our "thinking" process, implementing a self- refinement loop that simulates a dialogue between an AI "designer" and an AI "critic", aiming to find the globally optimal layout through iterative search."
Decoupled generative paradigm: A design that separates reasoning/planning from rendering to improve control and quality. "a decoupled generative paradigm for high-fidelity scientific illus- tration generation."
Diffusion models: Generative models that synthesize images by iteratively denoising noise, conditioned on input text or other signals. "Recent progress in diffusion models (Song et al., 2021) have greatly improved the performance of T2I generation"
Directed graph: A graph with edges that have a direction, used here to encode relationships among entities in a layout. "encodes a directed graph Go = (Vo, Eo)."
Erase-and-correct: A post-processing strategy that removes rasterized text and replaces it with corrected, crisp vector text. "We improve text legibility via an erase-and-correct process."
FID (Fréchet Inception Distance): A metric that measures distributional distance between sets of images, often misaligned with schematic correctness. "traditional T2I met- rics (e.g., FID (Jayasumana et al., 2024)) are usually misaligned with the requirements for logi- cal and topological correctness."
Inter-Rater Reliability (IRR): A measure of consistency among multiple annotators’ judgments. "with a high Inter-Rater Reliability (IRR, Cohen's k = 0.91)"
Layout planning: The step of organizing elements spatially and structurally before rendering. "Semantic Parsing and Layout Planning"
Likert scale: An ordinal rating scale commonly used in surveys, here from 1 to 5. "rated on a 1-5. Likert scale for Accuracy, Clarity, and Aesthetics."
Long-context reasoning: The ability to process and reason over very large texts or contexts. "highlighting the need for robust long-context reasoning."
Long-context Scientific Illustration Design: The task of generating figures from entire long documents rather than short captions. "specifically targeting Long- context Scientific Illustration Design"
Markup-based symbolic layout: A structured representation of a figure in a markup language (e.g., SVG/HTML) encoding geometry and relations. "a markup- based symbolic layout S0 (SVG/HTML)"
Multimodal generative model: A model that takes inputs from multiple modalities (e.g., text plus layout) to generate images. "These inputs are fed into a multimodal generative model to render an image Ipolished"
OCR (Optical Character Recognition): Technology that extracts text strings and bounding boxes from images. "a OCR engine ocr extracts preliminary strings and bounding boxes"
Publication-ready: Meeting the quality standards (accuracy and aesthetics) suitable for inclusion in academic venues. "producing publication-ready scientific il- lustrations."
Rasterize: Convert vector or symbolic layouts into pixel-based images. "We additionally rasterize S0 into a layout reference image I0"
Reasoned Rendering: A paradigm that first reasons about structure and style, then renders, to balance fidelity and aesthetics. "an agentic framework based on the Rea- soned Rendering paradigm."
Referenced scoring: Evaluation where the model’s output is judged with access to the reference figure and source text. "Referenced scoring, where a VLM is provided with the full text, the ground-truth figure, and the generated image."
Self-refinement loop: An iterative optimization where the system critiques and improves its own outputs over multiple rounds. "a novel self- refinement loop-simulating a dialogue between an AI designer and critic-iteratively optimizes this blueprint"
Semantic parsing: Converting unstructured text into a structured representation (entities, relations, layout). "Semantic Parsing and Layout Planning"
Structural fidelity: The degree to which a generated figure preserves the intended structure and relationships. "they struggle to pre- serve structural fidelity (Liu et al., 2025)."
Symbolic blueprint: A high-level, discrete plan of the figure’s elements and relations before rendering. "distilling unstructured text into a structured, symbolic blueprint."
Symbolic layout: A machine-readable, structured depiction of figure geometry and topology (e.g., in SVG/HTML). "a machine-readable symbolic layout S0 (e.g., SVG/HTML)"
Test-time scaling: Improving performance by increasing the number of inference-time refinement iterations. "we conduct a test-time scaling experiment"
Text-to-image (T2I): Models that generate images directly from textual descriptions. "mainstream end-to-end text-to-image (T2I) models"
Topological correctness: Accuracy of connectivity and spatial relationships among components in a diagram. "requirements for logi- cal and topological correctness."
Vector-text overlays: Rendering text as resolution-independent vectors over an image for crisp legibility. "we render Tcorr as vector-text overlays at Cocr on top of Ierased"
Vision-LLM (VLM): A model that jointly processes visual and textual inputs for understanding or evaluation. "to fine-tune a vision-LLM."
VLM-as-a-judge paradigm: Using a VLM to evaluate generated images against text and references. "our evaluation protocol leverages the VLM-as-a-judge paradigm"
Win-Rate: The percentage of times a method’s output is preferred in pairwise comparisons. "Win-Rate is calculated through blind pairwise comparisons against the reference"

AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations

Summary

AutoFigure: Automated Generation of Publication-Ready Scientific Illustrations

Introduction and Context

FigureBench: Foundation for Benchmarking Scientific Illustration Generation

AutoFigure: Agentic, Decoupled Reasoned Rendering

Evaluation: Quantitative, Qualitative, Human Expert Assessment

Automated and Human Evaluations

Open-Source Backbone Evaluation

Human-LLM Correlation

Limitations and Theoretical Implications

Broader Impact and Future Development

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions did the researchers ask?

How did they do it?

The dataset: FigureBench

The tool: AutoFigure

How they judged the results

What did they find?

Why does this research matter?

What could happen next?

Key terms explained

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Notes on Feasibility and Risk

Glossary

Open Problems

Continue Learning

Related Papers

Authors (9)

Collections

GitHub

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research