Papers
Topics
Authors
Recent
2000 character limit reached

ProImage-Bench: Scientific Image Evaluation

Updated 20 December 2025
  • ProImage-Bench is a rubric-based benchmark that rigorously assesses image generation using fine-grained binary checks.
  • It covers biology schematics, engineering drawings, and general scientific diagrams with 654 tasks derived from over 10,000 text–image pairs.
  • The benchmark enables iterative model improvements through automated scoring, detailed diagnostics, and rubric-informed refinements.

ProImage-Bench is a rubric-based benchmark designed for the rigorous evaluation of professional image generation, with an emphasis on information-dense, scientifically precise illustrations synthesized from technical descriptions. Unlike conventional datasets that reward visual plausibility, ProImage-Bench focuses on specification-faithful renderings across core scientific and technical domains. Its hierarchical rubric machinery quantifies multi-criteria correctness and enables both diagnostic assessment and iterative improvement of generation models (Ni et al., 13 Dec 2025).

1. Target Domains and Dataset Composition

ProImage-Bench encompasses three principal domains:

  • Biology Schematics: These include cell or organelle drawings, metabolic pathways, and anatomical cross-sections. Tasks emphasize precise molecular and morphological depiction, such as bilayer membrane arrangement or sequential stages of cell division.
  • Engineering/Patent Drawings: Covered images span mechanical assemblies, circuit diagrams, architectural cutaways, and patent figures. Evaluation demands strict spatial relationships, accurate labeling, and adherence to domain conventions such as orthographic views and dimensioning.
  • General Scientific Diagrams: This category includes charts, process-flow diagrams, ecosystem posters, and generic scientific illustrations. While less rigid than the others, these images require correctness in data content and layout structure.

A curated dataset of 654 “tasks” was constructed by extracting >10,000 text–image pairs from textbooks, technical reports, patent filings, and scientific websites. After expert filtering for quality and context, the distribution of tasks is as follows:

Domain Number of Tasks
Biology 318
Engineering 232
General 104

2. Rubric Hierarchy and Construction Methodology

Each benchmark task is paired with a rubric that hierarchically decomposes correctness into abstract criteria and fine-grained binary “unit tests.” For any image-generation task tt, its criterion set is C={c1,...,cC}\mathcal{C} = \{c_1, ..., c_{|\mathcal{C}|}\}. Each criterion cic_i (e.g., "the phospholipid bilayer is symmetric") further breaks down into binary points:

ci{pi,1,pi,2,...,pi,ci}c_i \rightarrow \{p_{i,1}, p_{i,2}, ..., p_{i,|c_i|}\}

where pi,jp_{i,j} are yes/no questions probing fine aspects (e.g., "Is each leaflet identical?").

Across the entire corpus:

  • C=6,076|\mathcal{C}| = 6{,}076 criteria
  • ici=44,131\sum_i |c_i| = 44{,}131 binary checks

Rubric generation leverages large multimodal models (LMMs, e.g., GPT-4o), combining OCR-extracted labels, figure context (caption, alt text, surrounding paragraphs), and domain-specific prior knowledge. The multistage pipeline proceeds as:

  1. Extract labels.
  2. Use LMMs to distill a detailed image-generation instruction tt.
  3. Enumerate explicit and implicit criteria via further LMM prompting.
  4. Expand criteria into binary checklists.
  5. Append domain-general rubrics (e.g., label legibility).
  6. Deduplicate and refine prompts for unambiguity via human/LMM review.

3. Automated Judging and Scoring Protocols

The evaluation system relies on fully automated LMM-based judging, with two principal metrics:

Rubric Accuracy:

This global score captures the overall pass rate across binary checks.

RubricAccuracy=1ieiici\text{RubricAccuracy} = 1 - \frac{\sum_i e_i}{\sum_i |c_i|}

where eie_i is the number of failed points for criterion cic_i.

Criterion Score:

Penalizes criteria with multiple failed checks exponentially, providing interpretable fidelity:

criterionScorei=0.5ei\text{criterionScore}_i = 0.5^{e_i}

Aggregating over all criteria:

CriterionScore=1Ci=1C0.5ei\text{CriterionScore} = \frac{1}{|\mathcal{C}|} \sum_{i=1}^{|\mathcal{C}|} 0.5^{e_i}

This approach yields scores in (0,1](0,1], sharply attenuating the contribution of criteria with compounding failures.

4. Model Benchmarking and Performance Analysis

Representative text-to-image models were assessed:

Model Rubric Accuracy Criterion Score
GPT-4o 0.660 0.382
Nano Banana 0.664 0.381
Flux 0.551 0.270
Imagen-3 0.577 0.287
Seedream 0.642 0.365
Wan2.5 0.692 0.420
Nano Banana Pro 0.791 0.553

Domain-specific performance (Nano Banana Pro):

Domain Rubric Accuracy Criterion Score
Biology 0.849 0.625
Engineering 0.708 0.434
General 0.816 0.601

Results reveal persistent gaps: no model surpasses 0.791 rubric accuracy or 0.553 criterion score overall. Nano Banana Pro outperforms but still demonstrates frequent omissions (components, labels), relational errors (e.g., biological progression), and style deficiencies (e.g., legibility). Engineering images consistently yield the lowest scores, attributed to their dimensional and structural rigor.

Qualitative analysis underscores typical failure cases, such as misplaced anatomical features (e.g., alveolar ducts), unlabeled engineering components (lever, pin), and reversed flow arrows in general diagrams.

5. Rubric-Informed Iterative Refinement Paradigm

The benchmark demonstrates that rubric signals are suitable for actionable supervision in an iterative editing workflow:

  1. Initial image x0x_0 is generated from instruction tt.
  2. The LMM evaluator identifies failed binary points.
  3. Failures are consolidated into natural-language edit instructions via LMM.
  4. An editing model (GPT-4o in “edit” mode) produces a refined xk+1x_{k+1}.
  5. The process iterates, retaining the best-scoring image.

Empirical results starting from a GPT-4o baseline (k=0k=0):

Iteration Rubric Accuracy Criterion Score
0 0.653 0.388
1 0.783 0.518
2 0.820 0.579
3 0.847 0.629
4 0.863 0.658
5 0.874 0.675
10 0.903 0.735

After 10 iterations, scores increase to 0.865 rubric accuracy and 0.697 criterion score. This suggests ProImage-Bench’s fine-grained rubrics are effective not only as diagnostic metrics but as strong supervision signals for specification-faithful scientific image refinement.

6. Implications and Positioning within Scientific Illustration Evaluation

ProImage-Bench sets a precedent for evaluating professional image generation at scale. Its rubric-based pipeline systematically decomposes correctness into actionable units, establishing a scalable protocol for both benchmarking and iterative model improvement. By targeting biologically and technically demanding domains, it exposes shortfalls in current open-domain generative models and provides a robust mechanism for driving advances in scientific image fidelity.

This suggests that future developments in text-to-image and autoregressive visual models may increasingly rely on similar high-granularity rubric feedback for both evaluation and model update, especially in scientific, engineering, and educational publishing contexts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to ProImage-Bench.