IGenBench: Evaluating T2I Reliability
- IGenBench is a public benchmark that evaluates text-to-infographic systems by testing the factual and semantic correctness of complex visual artifacts.
- It employs 600 manually curated cases across 30 infographic archetypes using a taxonomy-driven model with ten atomic yes/no verification questions.
- Automated metrics like question-level and infographic-level accuracy reveal critical failures in data completeness, ordering, and encoding in current T2I models.
IGenBench is a public benchmark for the rigorous evaluation of the reliability of text-to-infographic (T2I) generation systems. Reliability is defined as the factual and semantic correctness of generated infographics, which are complex visual artifacts integrating structured data visualizations, textual elements, and decorative illustrations. Unlike standard text-to-image benchmarks, IGenBench systematically diagnoses whether T2I models can precisely render both the required data encodings and supporting textual/visual components as demanded by high-stakes applications in analytics and scientific communication (Tang et al., 8 Jan 2026).
1. Motivation and Scope
Reliability in text-to-infographic synthesis is distinctively challenging due to the composite nature of infographics, which tightly couple visual encodings (e.g., bar/line charts, pie slices, treemaps) with textual labels, annotations, and sometimes purely decorative graphics. Generic T2I evaluation protocols primarily assess prompt-image alignment and aesthetic quality but lack granularity in verifying that all data values, categories, axis scales, and semantic pairings (such as legends or callouts) are correctly expressed. Even small defects—such as omitted data series, swapped legend entries, or inaccurate quantitative encoding—can gravely mislead end-users. IGenBench was designed to expose these latent failures by prioritizing atomic, interpretable verification (Tang et al., 8 Jan 2026).
Its primary contributions are:
- The first benchmark to emphasize factual and semantic reliability specifically for text-to-infographic generation.
- Coverage of 600 manually curated test cases across 30 infographic archetypes, capturing breadth of real-world infographic styles and layouts.
- A taxonomy-driven evaluation model with 10 atomic yes/no question types targeting both design fidelity and data-fidelity dimensions.
- An automated verification pipeline using a state-of-the-art multimodal LLM (MLLM), yielding quantitative, interpretable metrics at both atomic and holistic levels.
2. Taxonomy of Atomic Verification Questions
IGenBench formalizes reliability assessment via ten atomic yes/no question types, collectively spanning the full spectrum of infographic correctness. These are:
| Question Type | Verification Target | Example Application |
|---|---|---|
| Title / Subtitle | Main heading and subheadings conform to prompt | Ensures precise infographic titling |
| Chart / Diagram Type | Rendered visualization type matches the specification | "Pie chart" vs. "treemap" error detection |
| Decorative / Non-data | Decorative icons/images correct or omitted as required | Detects extraneous clipped art |
| Annotations / Callouts | Numeric/explanatory labels in correct locations | "+45%" label adjacency |
| Axes / Scales | Axis lines, tick marks, ranges, and labels accurately rendered | y-axis max labeled as "50" |
| Legend / Category Mapping | Legend keys present and correctly mapped to categories | Color-to-series correctness |
| Data Marks | Visual encodings instantiated for every data item | Bar/slice/dot count precision |
| Data Completeness | No data points missing or extraneous | "Exactly 9 brands" match |
| Data Ordering | Sequence or sorting matches prompt logic | Ascending/descend order assessment |
| Data Encoding | Visual properties (size/area/color) map to the data values | Proper proportionality of chart elements |
Questions are instantiated per test case by parsing prompt constraints (prompt-derived) and augmenting with expert-informed requirements targeting data completeness, ordering, and quantitative encodings. This yields 7–11 atomic questions per infographic, mapping directly onto specific visual or textual elements (Tang et al., 8 Jan 2026).
3. Dataset Construction and Test Case Generation
IGenBench construction began with a pool of approximately 42,000 real-world infographics from sources including Statista, Visual Capitalist, and ChartGalaxy. A hierarchical taxonomy defined 30 distinct chart/infographic types, distributed over six conceptual categories: Composition, Categorical Comparison, Trend/Evolution, Deviation/Gap, Correlation/Flow, and a multi-panel "bonus" category.
The dataset generation pipeline consisted of:
- Embedding all samples using an MLLM, performing k-means clustering per type ().
- Selecting representative medoids from each cluster, then conducting manual review to remove redundant, low-quality, or illegible cases.
- Resulting in a benchmark set of 600 diverse, high-quality infographics, with all examples manually vetted.
Prompt generation for evaluation was engineered to systematically elicit both explicit feature implementation and potential model failures. For every reference infographic, an MLLM extracted:
- A structural design description enumerating layout, chart type, axes, legends, and decorative items, omitting superficial cosmetic details.
- The underlying data table in structured format.
Human annotators then verified and corrected each MLLM-generated prompt, concatenating the design specification and data into a rich, self-contained prompt (ranging from tens to thousands of tokens). These prompts enforce both layout and data-fidelity constraints, exposing failure modes such as missing visuals, wrong scales, spurious icons, or misassigned categories (Tang et al., 8 Jan 2026).
4. Automated Evaluation Framework and Metrics
Verification in IGenBench is driven by automatically decomposing each prompt into atomic yes/no questions. The process:
- Parses explicit constraints from the prompt into "prompt-derived" questions ().
- Supplements these with "expert-informed" dimensions—concretely, Data Completeness, Data Ordering, and Data Encoding—resulting in additional, chart-specific questions even if not explicitly mentioned ().
- The union set yields a total of 5,259 atomic questions across 600 infographics.
Evaluation employs the Gemini-2.5-Pro MLLM tasked to verify, for each generated image and question , whether the fact or element is satisfied. The indicator function if unambiguously correct, $0$ otherwise (any ambiguity defaults to incorrect).
Two principal metrics quantify system reliability:
- Question-level Accuracy (Q-ACC):
where is the total number of questions, is the model's yes/no answer, and is absolute ground truth.
- Infographic-level Accuracy (I-ACC):
where is the number of infographics, and are all questions for infographic . I-ACC thus measures strict end-to-end correctness, requiring all constraints per infographic to be satisfied.
5. Model Evaluation Results and Diagnostic Insights
Ten state-of-the-art T2I models were systematically evaluated on IGenBench. Results reveal a pronounced three-tier performance hierarchy:
| Tier | Model(s) | Q-ACC | I-ACC |
|---|---|---|---|
| Top | Nanobanana-Pro | 0.90 | 0.49 |
| Second | Seedream-4.5, GPT-Image-1.5 | 0.61 / 0.55 | 0.06 / 0.12 |
| Third | Remaining eight models | <0.48 | ≈ 0 |
The average across all models is Q-ACC = 0.39, I-ACC = 0.07. Notably, high question-level accuracy does not guarantee holistic correctness: even top-tier models demonstrate low or moderate I-ACC due to missed atomic requirements in key dimensions (Tang et al., 8 Jan 2026).
Analysis of per-dimension bottlenecks uncovers universal deficits in model performance concerning data-fidelity:
- Data Completeness: Q-ACC 0.21 (lowest)
- Data Encoding: 0.26
- Data Ordering: 0.27
Layout and text-oriented components are somewhat more robust (Title: 0.54, Legend: 0.46, Decorative: 0.66). Even the leading model (Nanobanana-Pro) achieves 0.84 on Data Completeness and 0.86 on Data Encoding, indicating persistent challenges.
Key takeaways:
- Holistic correctness remains unachieved; most models frequently fail at least one critical dimension, resulting in very low I-ACC even when Q-ACC is strong.
- Data-fidelity is the principal limiting factor—current diffusion and transformer-based T2I methods excel at layout and aesthetics but lack fine-tuned quantitative reasoning for mapping structured data into precise visual encodings.
- The atomic, question-driven evaluation enables diagnosis of recurrent failure modes (e.g., omitted series, inexact chart scales, swapped legends), directly informing directions for remediation by future models.
6. Implications and Future Trajectories
IGenBench establishes the primacy of rigorous, interpretable, dimensionally-structured evaluation for infographic generation. Human oversight and post-editing remain indispensable for mission-critical domains; nevertheless, systematic benchmarks of this type can drive progress toward models that require less manual correction.
A plausible implication is that future T2I systems will require:
- Enhanced chart semantics understanding (e.g., axis/legend mapping modules).
- Explicit numeric reasoning pipelines to ensure correct scaling and data encoding.
- Advanced layout-driven attention mechanisms for completeness and order.
IGenBench provides quantitative baselines, diagnostic methodology, and a public testbed for advancing the reliability of generative infographic models toward demanding real-world deployment (Tang et al., 8 Jan 2026).