FigureBench: AI-Driven Scientific Illustrations

Updated 6 February 2026

FigureBench is a large-scale, curated benchmark for generating accurate, publication-ready scientific illustrations from long-form scientific texts.
It evaluates AI systems on deep semantic comprehension, structural planning, and aesthetic rendering across multiple scientific domains.
The benchmark employs quantitative metrics like FID, CLIPScore and expert reviews to rigorously assess diagram quality and usability.

FigureBench is the first large-scale, systematically curated benchmark specifically designed to evaluate and advance the automatic generation of publication-ready scientific illustrations from long-form scientific texts. The benchmark addresses critical gaps in existing datasets by requiring deep semantic comprehension, structural distillation, and the synthesis of both logically accurate and aesthetically high-quality diagrams. FigureBench enables rigorous quantitative and qualitative assessment of AI-driven illustration systems operating across heterogeneous scientific domains and diagram types, supporting the development of next-generation automatic figure generation tools (Zhu et al., 3 Feb 2026).

1. Motivation and Scope

Scientific illustrations—including schematic diagrams, block diagrams, taxonomy trees, algorithmic workflows, and conceptual roadmaps—are essential for precise communication in research articles, survey papers, technical blogs, and textbooks. Manual preparation is labor-intensive, often requiring days of effort from domain experts due to the need for:

Thorough logical parsing of long-form scientific documents (often exceeding 10,000 tokens)
Abstraction and mapping of core concepts and relational structures into visual entities and connections
Layout optimization that achieves both structural fidelity (accurate connectivity, topological integrity) and professional aesthetic qualities

Existing datasets (e.g., Paper2Fig100k, ACL-Fig, SciCap+) predominantly address figure captioning or style transfer from short captions, and do not directly test the full pipeline of long-context text understanding, advanced structural planning, and high-fidelity rendering. Moreover, generic text-to-image models inadequately preserve logical consistency, whereas code-generation approaches such as TikZ produce rigid but unaesthetic diagrams. FigureBench explicitly targets these limitations by establishing a benchmark that requires generating custom diagrams adapted to long and complex inputs, with evaluation criteria for both structural and aesthetic adequacy (Zhu et al., 3 Feb 2026).

2. Dataset Composition and Curation

The FigureBench dataset comprises 3,300 high-quality scientific text–figure pairs sampled from four principal domains:

Research articles (papers)
Survey papers (arXiv survey subset)
Technical blogs (e.g., ICLR Blog Track)
Textbooks (e.g., OpenStax, under CC BY-4.0 licenses)

The dataset is divided into a development set (3,000 pairs for training and fine-tuning) and a strictly held-out test set (300 pairs for evaluation). The test set includes 200 conceptual figures from peer-reviewed research papers, selected and verified through a pipeline involving GPT-5 suggestion followed by human annotation (Cohen’s κ=0.91), and 100 hand-curated instances from surveys, blogs, and textbooks to maximize structural and stylistic diversity.

Diagram categories include:

Schematic and block diagrams (including architectures, data-processing flows)
Algorithmic and procedural flows
Taxonomy and roadmap trees
Software-engineering process models
Conceptual and methodological frameworks

Quantitative analysis with InternVL-3.5 demonstrates high intrinsic difficulty: the average textual density inside figures exceeds 40%, with each diagram employing approximately six distinct colors and containing five to seven interconnected graphical components (Zhu et al., 3 Feb 2026).

3. Formal Problem Definition and Generative Framework

Let $T = \{ t_1, ..., t_n \}$ denote the collection of input long-form texts and $F = \{ f_1, ..., f_n \}$ the corresponding reference figures. The objective is to learn a generator $G : T \rightarrow I_\mathrm{final}$ , such that $I_\mathrm{final}$ for each $t_i$ faithfully reproduces the structural and stylistic properties of $f_i$ .

The leading baseline, AutoFigure, implements a two-stage pipeline:

Semantic Parsing and Layout Planning: From $t_i$ , extract a symbolic layout $S^0$ (as, e.g., SVG/HTML defining nodes, edges, text labels, and positions) alongside a style descriptor $A^0$ .
Aesthetic Rendering and Text Refinement: Given $(S^0, A^0)$ , produce a rasterized image $I_\mathrm{final}$ through an erase-and-correct step for crisp, readable text.

Layout planning utilizes an iterative "critic-and-refine" mechanism:

$F^{(i)} = \mathrm{Critic}(S^{\mathrm{best}}, A^{\mathrm{best}})$

$(S^{\mathrm{cand}}, A^{\mathrm{cand}}) = \mathrm{Gen}(T_\mathrm{method}, F^{(i)})$

$\text{if } \mathrm{score}(S^{\mathrm{cand}}, A^{\mathrm{cand}}) > \mathrm{score}(S^{\mathrm{best}}, A^{\mathrm{best}}) \text{ then}$

$~~~~~(S^{\mathrm{best}}, A^{\mathrm{best}}) \leftarrow (S^{\mathrm{cand}}, A^{\mathrm{cand}})$

This process iterates until convergence or a fixed computational budget is exhausted, yielding the final symbolic layout and styling parameters (Zhu et al., 3 Feb 2026).

4. Evaluation Protocols and Metrics

Data Splits and Usage

Development set (3,000 pairs): For training, hyperparameter optimization, and in-domain model validation.
Test set (300 pairs): For final model assessment; explicit prohibition against test set training enables robust, unbiased evaluation. Cross-domain robustness can be empirically analyzed by restricting training to single-domain samples and assessing generalization on out-of-domain types.

Automated Evaluation Metrics

Classic metrics, while supported, exhibit limited alignment with scientific diagram quality:

Fréchet Inception Distance (FID): Measures distributional similarity between real/generated images.
CLIPScore: Computes cosine similarity between CLIP-encoded text and generated image embeddings.
Layout Intersection over Union (IoU): Quantifies overlap between ground-truth and predicted layout masks.

Vision-LLM (VLM)-as-a-Judge Paradigm

To better assess scientific and communicative fidelity, FigureBench employs VLMs (e.g., Prometheus-Vision) via two main strategies:

Referenced Scoring: Each VLM sees $(t_i, f_i, \tilde{i}_i)$ and assigns sub-scores (0–10) on eight criteria grouped by visual design (aesthetic quality, expressiveness, professional finish), communication effectiveness (clarity, logical flow), and content fidelity (accuracy, completeness, appropriateness). Overall score is the arithmetic mean.
Blind Pairwise Comparison: The VLM is presented with $(t_i, \tilde{i}_i, f_i)$ in randomized order and selects the superior illustration according to seven predefined criteria.

A parallel expert evaluation—10 first-author domain specialists—uses Likert scales for accuracy, clarity, and aesthetics; holistic ranking; and a binary "would-use-in-paper" decision. This multi-modal assessment approach increases validity and practical relevance (Zhu et al., 3 Feb 2026).

5. Baseline Methods and Empirical Findings

Four method families have been benchmarked on FigureBench across the four primary scientific domains:

Domain	GPT-Image	Gemini-HTML	Gemini-SVG	Diagram Agent	AutoFigure (Score / Win-Rate %)
Blog	4.39	5.61	4.39	1.92	7.60 (75%)
Survey	4.63	4.77	4.25	2.22	6.99 (78.1%)
Textbook	5.67	6.53	6.12	2.25	8.00 (97.5%)
Paper	3.47	6.35	5.49	2.12	7.03 (53.0%)

AutoFigure achieves highest scores and win rates for all domains, demonstrating the efficacy of its decoupled approach—structural reasoning separated from aesthetic rendering.
Code-generation baselines (Gemini-HTML/SVG) exhibit reasonable structural accuracy but limited aesthetic appeal.
End-to-end text-to-image approaches (GPT-Image) underperform in content fidelity.
Diagram Agent lacks adequate layout reasoning and performs poorly.

In the expert study, AutoFigure obtained an 83.3% win rate in direct comparison with other methods, and 66.7% of experts were willing to use its output directly in camera-ready manuscripts (Zhu et al., 3 Feb 2026).

6. Data and Code Availability; Practical Usage

All benchmark resources—including code, pretrained models, annotated data splits, and evaluation scripts—are released at https://github.com/ResearAI/AutoFigure, with a public HuggingFace Space and dataset card for accessible demonstration and download.

A typical workflow encompasses:

Repository cloning and dataset acquisition via the HuggingFace API.
Training or fine-tuning models on the development set.
Model evaluation on the test set using VLM-based judge scripts.
Direct comparison with provided baselines, including an end-to-end AutoFigure pipeline.

The open licensing (open-source and Creative Commons) and comprehensive source attributions facilitate downstream research, integration, and reproducibility.

7. Significance and Outlook

FigureBench lays a foundational resource for advancing long-context-to-illustration systems by providing a challenging, diverse corpus with precise curation standards and systematic, multi-modal benchmarking. The framework’s explicit focus on both structural fidelity and aesthetic quality establishes a new paradigm for evaluating scientific illustration generation, distinguishing itself from short-caption or style-transfer benchmarks. A plausible implication is that progress on FigureBench will catalyze new AI-assisted workflows, significantly reducing the bottleneck of manual figure drafting in scientific publishing (Zhu et al., 3 Feb 2026).

Markdown Upgrade to Chat

References (1)

AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FigureBench.