SciGenBench: Scientific Image Synthesis Benchmark
- SciGenBench is a purpose-built benchmark that evaluates scientific image synthesis by ensuring generated images contain precise, logically constrained information.
- The framework employs atomic, visually grounded quizzes and multi-dimensional metrics to assess information utility and logical validity across diverse scientific disciplines.
- It compares generation paradigms—pixel-based methods, programmatic synthesis, and the ImgCoder workflow—to highlight trade-offs between visual expressiveness and structural precision.
Searching arXiv for the cited SciGenBench paper to ground the article and citation. SciGenBench is a purpose-built benchmark and evaluation framework for scientific image synthesis in multimodal reasoning. It is designed to test whether generated images not only appear plausible but also encode the precise, logically constrained information required for scientific problem solving. The framework is motivated by a persistent visual–logic divergence in existing text-to-image systems: outputs may be visually appealing yet scientifically incorrect, so small errors in structure, geometry, topology, or domain-specific constraints can invalidate the intended meaning. Within the broader study of scientific image synthesis, SciGenBench shifts evaluation away from aesthetics and coarse semantic alignment toward information utility and logical validity, and it is used to compare direct pixel-based generation, programmatic synthesis, and the logic-driven ImgCoder workflow (Lin et al., 17 Jan 2026).
1. Motivation, definition, and scope
SciGenBench addresses a failure mode specific to scientific visualization. In scientific domains, diagrams, plots, tables, molecular structures, field schematics, and experimental setups are not interchangeable visual illustrations; they are formal carriers of quantitative, relational, and structural information. Misaligned axes, incorrect circuit connectivity, invalid chemical bonds, or geometrically inconsistent annotations can render an image unusable for reasoning even when it remains perceptually convincing.
The benchmark therefore asks two coupled questions. First, does a generated image faithfully encode the exact information that a downstream solver needs? Second, does the image conform to the relevant domain axioms? This reorients evaluation from generic image fidelity toward scientifically grounded correctness.
Its scope is broad but structured. SciGenBench uses a hierarchical taxonomy spanning five subjects—Mathematics, Physics, Chemistry, Biology, and Universal—and twenty-five image types. The resulting coverage includes mathematically constrained figures such as plane and solid geometry, physics-specific schematics such as circuits and optical rays, chemistry-specific structures such as skeletal formulas and orbital depictions, biology-specific diagrams such as genetics and molecular processes, and cross-domain forms such as plots, tables, grids, graph or flow diagrams, and experimental setups. This breadth is intended to stress domains where logic and structure dominate visual plausibility.
2. Benchmark composition and task construction
SciGenBench comprises approximately 1.4K problems derived primarily from verified scientific text, including MegaScience and WebInstruct-verified sources, filtered for visualizability and correctness. It also includes a real-image reference subset, SciGenBench-SeePhys, to support comparison against authentic scientific visuals (Lin et al., 17 Jan 2026).
The benchmark is organized by a two-level “Subject → Image Type” taxonomy. Mathematics includes plane geometric, solid geometric, analytic geometry, and set/probability. Physics includes mechanical, field diagram, waveform, optical ray, astronomical, circuit, and thermodynamic. Chemistry includes molecular structure, electron config, reaction scheme, crystal structure, spectra, and orbital/quantum. Biology includes cell diagram, ecological, genetics, and molecular process. Universal includes plot/chart, graph/flow, table/grid, and experimental setup.
A central design choice is the transformation of each text instruction into atomic, visually grounded quizzes. This mechanism is intended to ensure that the image is indispensable for solving the associated task. Construction proceeds through fact extraction, blind filtration, density-based selection, and expert validation. In fact extraction, the instruction is decomposed into structured atomic facts and values such as object counts, topology, labeled parameters, directions, and spatial relations. In blind filtration, a blind solver without image access attempts each quiz in multiple trials; questions that can be answered without the image are discarded. This removes pseudo-multimodal items and enforces true visual dependence. Instances with richer visual checklists are then prioritized and validated by expert annotators.
Canonical prompt types illustrate the benchmark’s granularity. A plot/chart prompt such as “Plot on ” is expected to satisfy correct axes, labeled domain, correct curve shape, correct intercepts or extrema, and tick marks and units. A circuit prompt requires exact series topology, correct component symbols, labeled values, consistent current direction arrows, no extra components, and non-floating connections. Geometry prompts require valid point and segment layout, correctly labeled lengths and angles, and adherence to constraints such as perpendicularity or tangency. Molecular structure prompts require correct atom counts, bond orders, valency satisfaction, and exact ring or branch topology. Optical ray prompts require plausible optics in addition to correct symbolization and focus placement.
3. Evaluation methodology and metrics
SciGenBench evaluates scientific image synthesis along two complementary axes: information utility and logical validity. The first measures whether an image contains the facts needed to answer its associated atomic quizzes. The second measures whether the image is structurally and scientifically well-formed.
Information utility is operationalized through inverse validation. A strong VQA solver must answer all quiz questions associated with an image correctly. The inverse validation rate is defined as
where is the set of quizzes for image , is a VQA correctness indicator, and is the evaluation set. This criterion is intentionally strict: partial preservation of scientific content does not count as success.
Logical validity is measured through a multi-dimensional LMM-as-Judge rubric using Gemini-3-Flash, with scores from 0 to 2 on five dimensions: Correctness/Fidelity, Layout/Precision, Readability/Occlusion, Scientific Plausibility, and Expressiveness/Richness. Correctness/Fidelity captures strict prompt adherence, including compositional errors and omissions. Layout/Precision captures geometric and topological accuracy, including coordinate alignment. Readability/Occlusion captures label legibility and the absence of garbling or occlusion. Scientific Plausibility captures conformity to domain axioms such as valency or Newtonian mechanics. Expressiveness/Richness captures completeness and contextual clarity.
The full evaluation pipeline combines automated judging, inverse validation, auxiliary perceptual metrics, and human verification. LMM-as-Judge provides dimensional scores and critiques. Inverse validation uses a strong VQA engine, also Gemini-3-Flash. Traditional metrics—PSNR, SSIM, CLIP, and FID—are computed only on the real-image SeePhys subset and are treated as auxiliary rather than definitive measures of scientific correctness. The reported alignment analysis indicates that perceptual metrics correlate weakly with structure-sensitive dimensions, whereas LMM-as-Judge and inverse validation align better with logical fidelity and predict downstream utility gains.
The benchmark also frames scientific image generation as conditional generation under latent axioms. This suggests that the ideal output is not merely an image that matches textual semantics, but one that maximizes the probability of the correct reasoning outcome under domain constraints.
4. Generation paradigms and ImgCoder
SciGenBench is not only an evaluation dataset; it is also a comparative framework for multiple generation paradigms. The study evaluates direct pixel-based text-to-image systems, programmatic synthesis, and ImgCoder, a logic-driven workflow that separates reasoning from rendering (Lin et al., 17 Jan 2026).
The direct pixel-based category includes open-source models such as HunyuanImage-3.0 and Qwen-Image, and closed models such as GPT-Image-1, GPT-Image-1.5, Seedream-4.0, Flux2-flex, Nanobanana, and Nanobanana-Pro. Prompting uses a constraint-injection strategy described as textbook style, with no answer leakage and explicit visualization of entities.
Programmatic synthesis uses executable specifications, for example in Python with Matplotlib, to render figures deterministically. In this setting, structural precision is enforced by code execution rather than being inferred probabilistically from a generative prior. This is particularly important when exact coordinates, axes, circuit topology, or geometric constraints determine correctness.
ImgCoder formalizes programmatic synthesis as an “Understand → Plan → Code” workflow. In the Understand stage, the system identifies entities, relations, coordinates, and domain constraints. In the Plan stage, it specifies content, layout, coordinate systems or topology, labels, exact anchor points, and drawing constraints checked against axioms, while avoiding solution leakage. In the Code stage, it emits a complete, runnable Matplotlib script that renders the diagram deterministically. Intermediate specifications can include analytic plotting code, DSL-like instructions for circuits, and geometry drawings with equal-aspect constraints and precise annotations. Correctness is then checked through compilation or execution success, followed by inverse quiz validation and judge scoring.
A central conclusion from this comparison is a fundamental expressiveness–precision trade-off. Code-based synthesis guarantees structural precision, such as exact plotting of , correct intercepts or extrema, and precise axes. Pixel-based models can produce visually rich images, especially in visually textured settings, but often fail fine quantitative constraints in plots or dense diagrams. This suggests that scientific image synthesis cannot be reduced to generic aesthetic generation without loss of reasoning-critical content.
5. Empirical findings and downstream utility
The empirical study reports systematic differences across paradigms, models, and domains. Among pixel-based generators, Nanobanana-Pro is the strongest, with and judge scores of 1.59 for Correctness/Fidelity, 1.87 for Layout/Precision, 1.98 for Readability/Occlusion, 1.72 for Scientific Plausibility, and 1.93 for Expressiveness/Richness. Among programmatic methods, Gemini-3-Pro-ImgCoder reaches , with judge scores around 1.82–1.93 across dimensions, including 1.82 for Correctness/Fidelity, 1.93 for Layout/Precision, 1.91 for Readability/Occlusion, 1.93 for Scientific Plausibility, and 1.90 for Expressiveness/Richness. Gemini-3-Flash-ImgCoder also performs strongly at 0. By contrast, Qwen-Image and HunyuanImage-3.0 remain below 40% 1 and below 0.8 on correctness and layout, which the study identifies as evidence of a persistent open-source gap in the absence of explicit constraint mechanisms (Lin et al., 17 Jan 2026).
The error taxonomy clarifies why these gaps persist. Reported failure modes include compositional errors such as incorrect counts or attribute misbinding, rendering errors such as blurred lines or illegible text, structural errors such as broken parallelism or non-closed curves, dense data errors such as axis drift or misaligned ticks, and domain knowledge errors such as invalid valencies or implausible optical paths. Structural and dense-data failures are described as the most persistent bottlenecks for probabilistic pixel generation, whereas deterministic code execution reduces these failure modes.
Performance also varies by subject. ImgCoder variants consistently outperform pixel-based models in Mathematics, Physics, and Universal diagram types, where strict geometric or topological alignment is critical; Gemini-3-Pro-ImgCoder is reported as strongest in Math at up to approximately 69.86% 2, Physics at approximately 75.39%, and Universal at approximately 72.85%. Biology and some visually rich Chemistry subfields show relative advantages for pixel-based models, with Nanobanana-Pro particularly strong in Biology. Chemistry is mixed: molecular structures favor programmatic precision, whereas crystals and reaction schemes benefit from pixel expressiveness.
A further finding is a distributional gap between synthetic and real scientific figures. Even strong generators remain stylistically distinct from authentic visuals. FID reductions do not necessarily track structural correctness. The reported t-SNE analysis on CLIP embeddings shows separable synthetic and real clusters, and spectral analysis shows higher high-frequency energy, described as “digital sharpness,” in synthetic outputs.
SciGenBench also evaluates downstream utility through multimodal fine-tuning. Qwen3-VL-8B-Instruct is fine-tuned on synthetic, rigorously validated images from different generators, using multimodal adaptation that masks textual cues so that reasoning must rely on visual evidence. Training uses VeRL with 200 steps, batch 128, and 8 rollouts per prompt, and Compass-Verifier-8B supplies reward and evaluation judgments. On Geometry3K-test and MathVision-mini, the baseline average is 54.5. Fine-tuning on synthetic images yields 57.1 with Qwen-Image, 57.4 with Qwen-ImgCoder, 58.1 with Nanobanana-Pro, and 58.0 with Gemini-ImgCoder. Filtered subsets perform best: Nanobanana-Pro (Filt) reaches 58.2, which is reported as +3.7 absolute over baseline, and Qwen-Image (Filt) reaches 57.8. The study further reports that quality matters more than quantity, and that accuracy scales roughly log-linearly with data size, including an example gain of +2.2 points from approximately 50 to approximately 1.4K samples, without saturation.
6. Position in the literature, usage, limitations, and extensions
SciGenBench is positioned against conventional text-to-image evaluation regimes that emphasize perceptual similarity or coarse semantic alignment. The benchmark’s distinctive contribution is the insistence on visually grounded atomic quizzes, blind filtration to ensure visual necessity, multi-dimensional LMM-as-Judge scoring for structural correctness and scientific plausibility, and explicit treatment of programmatic validators and deterministic rendering as first-class elements of the evaluation problem (Lin et al., 17 Jan 2026).
For practical use, the recommended protocol is explicit. A model should generate images from curated scientific instructions under a textbook-style prompt policy that avoids solution leakage and makes given values explicit. Evaluation should then compute 3 using the provided quiz sets and a strong VQA engine, and run LMM-as-Judge across the five rubric dimensions. PSNR, SSIM, and FID on SciGenBench-SeePhys are optional and auxiliary only. For downstream testing, a target LMM can be fine-tuned using multimodal adaptation that hides numeric parameters in text while keeping them in the image, followed by evaluation on GEO3K and MathVision. The study identifies several recurrent pitfalls: over-reliance on FID or CLIP can conceal logical errors, unconstrained prompting can introduce solution leakage or spurious elements, and dense data regimes such as tables, grids, and topologically constrained diagrams are especially brittle for pixel models.
Reproducibility requires consistent judge prompts and VQA engines, blind filtration of quizzes to preserve visual dependence, logging of code compilation and execution success for programmatic methods, retry or error-recovery mechanisms, and per-domain reporting because performance varies substantially by subject and image type. Availability is via the project page at https://SciGenbench.github.io, with code, data, and documentation linked there. The paper is from OpenDataLab and collaborating institutions.
The framework also has explicit limitations. Reliance on closed-source LMMs for judging and VQA can introduce evaluator bias, although human verification mitigates rather than eliminates this problem. The dataset size, approximately 1.4K, is described as modest. The real–synthetic distributional gap remains unresolved, and standard metrics may remain misleading under those conditions. Ethical considerations include avoiding solution leakage, ensuring adherence to scientific conventions, and reducing hallucinations through verified text sources and expert review.
Future work is described along four lines: hybrid generation strategies that combine pixel expressiveness with code-level precision, domain-specific validators such as valency checkers or circuit topology parsers, larger-scale multimodal scientific data engines to probe stronger scaling laws, and methods to bridge the real–synthetic gap, including style transfer or spectral regularization, without sacrificing logical fidelity.
In a complementary direction, Metal-Sci extends the idea of purpose-built scientific benchmarking to scientific compute kernels rather than image synthesis. Its relevance to SciGenBench lies in its roofline-anchored scoring, structured multi-signal feedback, and especially its held-out end-of-run gate 4 as a mechanical oversight primitive for catching silent failures not visible from in-distribution scores (Gallego, 10 May 2026). A plausible implication is that analogous held-out oversight mechanisms could further strengthen SciGenBench-style evaluations whenever models are assessed as agentic optimizers rather than only as one-shot generators.
SciGenBench therefore occupies a specific methodological niche: it operationalizes scientific image synthesis as a correctness-critical multimodal problem in which information utility, structural validity, and downstream reasoning gains must be measured jointly rather than inferred from perceptual quality alone.