ProImage-Bench: Scientific Image Evaluation
- ProImage-Bench is a rubric-based benchmark that rigorously assesses image generation using fine-grained binary checks.
- It covers biology schematics, engineering drawings, and general scientific diagrams with 654 tasks derived from over 10,000 text–image pairs.
- The benchmark enables iterative model improvements through automated scoring, detailed diagnostics, and rubric-informed refinements.
ProImage-Bench is a rubric-based benchmark designed for the rigorous evaluation of professional image generation, with an emphasis on information-dense, scientifically precise illustrations synthesized from technical descriptions. Unlike conventional datasets that reward visual plausibility, ProImage-Bench focuses on specification-faithful renderings across core scientific and technical domains. Its hierarchical rubric machinery quantifies multi-criteria correctness and enables both diagnostic assessment and iterative improvement of generation models (Ni et al., 13 Dec 2025).
1. Target Domains and Dataset Composition
ProImage-Bench encompasses three principal domains:
- Biology Schematics: These include cell or organelle drawings, metabolic pathways, and anatomical cross-sections. Tasks emphasize precise molecular and morphological depiction, such as bilayer membrane arrangement or sequential stages of cell division.
- Engineering/Patent Drawings: Covered images span mechanical assemblies, circuit diagrams, architectural cutaways, and patent figures. Evaluation demands strict spatial relationships, accurate labeling, and adherence to domain conventions such as orthographic views and dimensioning.
- General Scientific Diagrams: This category includes charts, process-flow diagrams, ecosystem posters, and generic scientific illustrations. While less rigid than the others, these images require correctness in data content and layout structure.
A curated dataset of 654 “tasks” was constructed by extracting >10,000 text–image pairs from textbooks, technical reports, patent filings, and scientific websites. After expert filtering for quality and context, the distribution of tasks is as follows:
| Domain | Number of Tasks |
|---|---|
| Biology | 318 |
| Engineering | 232 |
| General | 104 |
2. Rubric Hierarchy and Construction Methodology
Each benchmark task is paired with a rubric that hierarchically decomposes correctness into abstract criteria and fine-grained binary “unit tests.” For any image-generation task , its criterion set is . Each criterion (e.g., "the phospholipid bilayer is symmetric") further breaks down into binary points:
where are yes/no questions probing fine aspects (e.g., "Is each leaflet identical?").
Across the entire corpus:
- criteria
- binary checks
Rubric generation leverages large multimodal models (LMMs, e.g., GPT-4o), combining OCR-extracted labels, figure context (caption, alt text, surrounding paragraphs), and domain-specific prior knowledge. The multistage pipeline proceeds as:
- Extract labels.
- Use LMMs to distill a detailed image-generation instruction .
- Enumerate explicit and implicit criteria via further LMM prompting.
- Expand criteria into binary checklists.
- Append domain-general rubrics (e.g., label legibility).
- Deduplicate and refine prompts for unambiguity via human/LMM review.
3. Automated Judging and Scoring Protocols
The evaluation system relies on fully automated LMM-based judging, with two principal metrics:
Rubric Accuracy:
This global score captures the overall pass rate across binary checks.
where is the number of failed points for criterion .
Criterion Score:
Penalizes criteria with multiple failed checks exponentially, providing interpretable fidelity:
Aggregating over all criteria:
This approach yields scores in , sharply attenuating the contribution of criteria with compounding failures.
4. Model Benchmarking and Performance Analysis
Representative text-to-image models were assessed:
| Model | Rubric Accuracy | Criterion Score |
|---|---|---|
| GPT-4o | 0.660 | 0.382 |
| Nano Banana | 0.664 | 0.381 |
| Flux | 0.551 | 0.270 |
| Imagen-3 | 0.577 | 0.287 |
| Seedream | 0.642 | 0.365 |
| Wan2.5 | 0.692 | 0.420 |
| Nano Banana Pro | 0.791 | 0.553 |
Domain-specific performance (Nano Banana Pro):
| Domain | Rubric Accuracy | Criterion Score |
|---|---|---|
| Biology | 0.849 | 0.625 |
| Engineering | 0.708 | 0.434 |
| General | 0.816 | 0.601 |
Results reveal persistent gaps: no model surpasses 0.791 rubric accuracy or 0.553 criterion score overall. Nano Banana Pro outperforms but still demonstrates frequent omissions (components, labels), relational errors (e.g., biological progression), and style deficiencies (e.g., legibility). Engineering images consistently yield the lowest scores, attributed to their dimensional and structural rigor.
Qualitative analysis underscores typical failure cases, such as misplaced anatomical features (e.g., alveolar ducts), unlabeled engineering components (lever, pin), and reversed flow arrows in general diagrams.
5. Rubric-Informed Iterative Refinement Paradigm
The benchmark demonstrates that rubric signals are suitable for actionable supervision in an iterative editing workflow:
- Initial image is generated from instruction .
- The LMM evaluator identifies failed binary points.
- Failures are consolidated into natural-language edit instructions via LMM.
- An editing model (GPT-4o in “edit” mode) produces a refined .
- The process iterates, retaining the best-scoring image.
Empirical results starting from a GPT-4o baseline ():
| Iteration | Rubric Accuracy | Criterion Score |
|---|---|---|
| 0 | 0.653 | 0.388 |
| 1 | 0.783 | 0.518 |
| 2 | 0.820 | 0.579 |
| 3 | 0.847 | 0.629 |
| 4 | 0.863 | 0.658 |
| 5 | 0.874 | 0.675 |
| 10 | 0.903 | 0.735 |
After 10 iterations, scores increase to 0.865 rubric accuracy and 0.697 criterion score. This suggests ProImage-Bench’s fine-grained rubrics are effective not only as diagnostic metrics but as strong supervision signals for specification-faithful scientific image refinement.
6. Implications and Positioning within Scientific Illustration Evaluation
ProImage-Bench sets a precedent for evaluating professional image generation at scale. Its rubric-based pipeline systematically decomposes correctness into actionable units, establishing a scalable protocol for both benchmarking and iterative model improvement. By targeting biologically and technically demanding domains, it exposes shortfalls in current open-domain generative models and provides a robust mechanism for driving advances in scientific image fidelity.
This suggests that future developments in text-to-image and autoregressive visual models may increasingly rely on similar high-granularity rubric feedback for both evaluation and model update, especially in scientific, engineering, and educational publishing contexts.