IGenBench: Evaluating T2I Reliability

Updated 15 January 2026

IGenBench is a public benchmark that evaluates text-to-infographic systems by testing the factual and semantic correctness of complex visual artifacts.
It employs 600 manually curated cases across 30 infographic archetypes using a taxonomy-driven model with ten atomic yes/no verification questions.
Automated metrics like question-level and infographic-level accuracy reveal critical failures in data completeness, ordering, and encoding in current T2I models.

IGenBench is a public benchmark for the rigorous evaluation of the reliability of text-to-infographic (T2I) generation systems. Reliability is defined as the factual and semantic correctness of generated infographics, which are complex visual artifacts integrating structured data visualizations, textual elements, and decorative illustrations. Unlike standard text-to-image benchmarks, IGenBench systematically diagnoses whether T2I models can precisely render both the required data encodings and supporting textual/visual components as demanded by high-stakes applications in analytics and scientific communication (Tang et al., 8 Jan 2026).

1. Motivation and Scope

Reliability in text-to-infographic synthesis is distinctively challenging due to the composite nature of infographics, which tightly couple visual encodings (e.g., bar/line charts, pie slices, treemaps) with textual labels, annotations, and sometimes purely decorative graphics. Generic T2I evaluation protocols primarily assess prompt-image alignment and aesthetic quality but lack granularity in verifying that all data values, categories, axis scales, and semantic pairings (such as legends or callouts) are correctly expressed. Even small defects—such as omitted data series, swapped legend entries, or inaccurate quantitative encoding—can gravely mislead end-users. IGenBench was designed to expose these latent failures by prioritizing atomic, interpretable verification (Tang et al., 8 Jan 2026).

Its primary contributions are:

The first benchmark to emphasize factual and semantic reliability specifically for text-to-infographic generation.
Coverage of 600 manually curated test cases across 30 infographic archetypes, capturing breadth of real-world infographic styles and layouts.
A taxonomy-driven evaluation model with 10 atomic yes/no question types targeting both design fidelity and data-fidelity dimensions.
An automated verification pipeline using a state-of-the-art multimodal LLM (MLLM), yielding quantitative, interpretable metrics at both atomic and holistic levels.

2. Taxonomy of Atomic Verification Questions

IGenBench formalizes reliability assessment via ten atomic yes/no question types, collectively spanning the full spectrum of infographic correctness. These are:

Question Type	Verification Target	Example Application
Title / Subtitle	Main heading and subheadings conform to prompt	Ensures precise infographic titling
Chart / Diagram Type	Rendered visualization type matches the specification	"Pie chart" vs. "treemap" error detection
Decorative / Non-data	Decorative icons/images correct or omitted as required	Detects extraneous clipped art
Annotations / Callouts	Numeric/explanatory labels in correct locations	"+45%" label adjacency
Axes / Scales	Axis lines, tick marks, ranges, and labels accurately rendered	y-axis max labeled as "50"
Legend / Category Mapping	Legend keys present and correctly mapped to categories	Color-to-series correctness
Data Marks	Visual encodings instantiated for every data item	Bar/slice/dot count precision
Data Completeness	No data points missing or extraneous	"Exactly 9 brands" match
Data Ordering	Sequence or sorting matches prompt logic	Ascending/descend order assessment
Data Encoding	Visual properties (size/area/color) map to the data values	Proper proportionality of chart elements

Questions are instantiated per test case by parsing prompt constraints (prompt-derived) and augmenting with expert-informed requirements targeting data completeness, ordering, and quantitative encodings. This yields 7–11 atomic questions per infographic, mapping directly onto specific visual or textual elements (Tang et al., 8 Jan 2026).

3. Dataset Construction and Test Case Generation

IGenBench construction began with a pool of approximately 42,000 real-world infographics from sources including Statista, Visual Capitalist, and ChartGalaxy. A hierarchical taxonomy defined 30 distinct chart/infographic types, distributed over six conceptual categories: Composition, Categorical Comparison, Trend/Evolution, Deviation/Gap, Correlation/Flow, and a multi-panel "bonus" category.

The dataset generation pipeline consisted of:

Embedding all samples using an MLLM, performing k-means clustering per type ( $C=10$ ).
Selecting representative medoids from each cluster, then conducting manual review to remove redundant, low-quality, or illegible cases.
Resulting in a benchmark set of 600 diverse, high-quality infographics, with all examples manually vetted.

Prompt generation for evaluation was engineered to systematically elicit both explicit feature implementation and potential model failures. For every reference infographic, an MLLM extracted:

A structural design description enumerating layout, chart type, axes, legends, and decorative items, omitting superficial cosmetic details.
The underlying data table in structured format.

Human annotators then verified and corrected each MLLM-generated prompt, concatenating the design specification and data into a rich, self-contained prompt (ranging from tens to thousands of tokens). These prompts enforce both layout and data-fidelity constraints, exposing failure modes such as missing visuals, wrong scales, spurious icons, or misassigned categories (Tang et al., 8 Jan 2026).

4. Automated Evaluation Framework and Metrics

Verification in IGenBench is driven by automatically decomposing each prompt into atomic yes/no questions. The process:

Parses explicit constraints from the prompt into "prompt-derived" questions ( $\mathcal{Q}_p$ ).
Supplements these with "expert-informed" dimensions—concretely, Data Completeness, Data Ordering, and Data Encoding—resulting in additional, chart-specific questions even if not explicitly mentioned ( $\mathcal{Q}_e$ ).
The union set $\mathcal{Q} = \mathcal{Q}_p \cup \mathcal{Q}_e$ yields a total of 5,259 atomic questions across 600 infographics.

Evaluation employs the Gemini-2.5-Pro MLLM tasked to verify, for each generated image and question $q_i$ , whether the fact or element is satisfied. The indicator function $\mathbb{I}(I, q_i)=1$ if unambiguously correct, $0$ otherwise (any ambiguity defaults to incorrect).

Two principal metrics quantify system reliability:

Question-level Accuracy (Q-ACC):

$\text{Q-ACC} = \frac{1}{N_q}\sum_{i=1}^{N_q}\mathbb{I}(\hat y_i = y_i)$

where $N_q$ is the total number of questions, $\hat y_i$ is the model's yes/no answer, and $y_i$ is absolute ground truth.

Infographic-level Accuracy (I-ACC):

$\text{I-ACC} = \frac{1}{N_f}\sum_{j=1}^{N_f}\prod_{i\in Q_j}\mathbb{I}(\hat y_i = y_i)$

where $N_f$ is the number of infographics, and $Q_j$ are all questions for infographic $j$ . I-ACC thus measures strict end-to-end correctness, requiring all constraints per infographic to be satisfied.

5. Model Evaluation Results and Diagnostic Insights

Ten state-of-the-art T2I models were systematically evaluated on IGenBench. Results reveal a pronounced three-tier performance hierarchy:

Tier	Model(s)	Q-ACC	I-ACC
Top	Nanobanana-Pro	0.90	0.49
Second	Seedream-4.5, GPT-Image-1.5	0.61 / 0.55	0.06 / 0.12
Third	Remaining eight models	<0.48	≈ 0

The average across all models is Q-ACC = 0.39, I-ACC = 0.07. Notably, high question-level accuracy does not guarantee holistic correctness: even top-tier models demonstrate low or moderate I-ACC due to missed atomic requirements in key dimensions (Tang et al., 8 Jan 2026).

Analysis of per-dimension bottlenecks uncovers universal deficits in model performance concerning data-fidelity:

Data Completeness: Q-ACC 0.21 (lowest)
Data Encoding: 0.26
Data Ordering: 0.27

Layout and text-oriented components are somewhat more robust (Title: 0.54, Legend: 0.46, Decorative: 0.66). Even the leading model (Nanobanana-Pro) achieves 0.84 on Data Completeness and 0.86 on Data Encoding, indicating persistent challenges.

Key takeaways:

Holistic correctness remains unachieved; most models frequently fail at least one critical dimension, resulting in very low I-ACC even when Q-ACC is strong.
Data-fidelity is the principal limiting factor—current diffusion and transformer-based T2I methods excel at layout and aesthetics but lack fine-tuned quantitative reasoning for mapping structured data into precise visual encodings.
The atomic, question-driven evaluation enables diagnosis of recurrent failure modes (e.g., omitted series, inexact chart scales, swapped legends), directly informing directions for remediation by future models.

6. Implications and Future Trajectories

IGenBench establishes the primacy of rigorous, interpretable, dimensionally-structured evaluation for infographic generation. Human oversight and post-editing remain indispensable for mission-critical domains; nevertheless, systematic benchmarks of this type can drive progress toward models that require less manual correction.

A plausible implication is that future T2I systems will require:

Enhanced chart semantics understanding (e.g., axis/legend mapping modules).
Explicit numeric reasoning pipelines to ensure correct scaling and data encoding.
Advanced layout-driven attention mechanisms for completeness and order.

IGenBench provides quantitative baselines, diagnostic methodology, and a public testbed for advancing the reliability of generative infographic models toward demanding real-world deployment (Tang et al., 8 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

IGenBench: Benchmarking the Reliability of Text-to-Infographic Generation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to IGENBENCH.

IGenBench: Evaluating T2I Reliability

1. Motivation and Scope

2. Taxonomy of Atomic Verification Questions

3. Dataset Construction and Test Case Generation

4. Automated Evaluation Framework and Metrics

5. Model Evaluation Results and Diagnostic Insights

6. Implications and Future Trajectories

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

IGenBench: Evaluating T2I Reliability

1. Motivation and Scope

2. Taxonomy of Atomic Verification Questions

3. Dataset Construction and Test Case Generation

4. Automated Evaluation Framework and Metrics

5. Model Evaluation Results and Diagnostic Insights

6. Implications and Future Trajectories

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research