IGenBench: T2I Infographic Benchmark

Updated 9 February 2026

IGenBench is a benchmarking framework that rigorously evaluates text-to-infographic generation models by addressing key semantic and data fidelity challenges.
It utilizes a curated dataset of 600 real-world infographics and a three-stage automated evaluation process focusing on completeness, ordering, and encoding of data.
Experimental analysis reveals a three-tier performance stratification among T2I models, highlighting persistent reliability bottlenecks in data representation and chart semantics.

IGenBench is a benchmarking framework designed to rigorously evaluate the reliability of text-to-infographic (T2I) generation models. It is the first resource to systematically address the end-to-end semantic and data fidelity challenges unique to infographics, providing a curated, automatically evaluated, and highly interpretable protocol focused on real-world infographic generation tasks (Tang et al., 8 Jan 2026).

1. Motivation and Context

Infographics are structured visual artifacts that integrate data visualizations (e.g., charts, maps), textual components (e.g., titles, legends), and illustrative elements (e.g., decorative icons) to convey complex information. State-of-the-art T2I models—such as Stable Diffusion variants, Nanobanana-Pro, and GPT-Image—have demonstrated strong capabilities in producing visually appealing and text-rich scenes but lack guarantees for faithful data encoding and correct chart semantics. This deficiency manifests in frequent, yet subtle, failure modes, such as visually plausible but quantitatively incorrect bar heights, missing or extraneous data elements, disordered categories, and semantically garbled textual annotations.

Prior evaluation resources have concentrated on either natural-image prompt adherence and photorealism (e.g., DrawBench, T2I-CompBench, EvalMuse-40K) or isolated chart code/reasoning tasks (e.g., VisJudge, MatPlotBench), without supporting holistic assessment of T2I-to-infographic reliability. IGenBench addresses this critical gap by furnishing a benchmark with realistic, diverse infographic scenarios and an automated, interpretable evaluation protocol that exposes both semantic and data-centric errors (Tang et al., 8 Jan 2026).

2. Dataset Construction and Task Taxonomy

IGenBench comprises 600 rigorously curated test cases derived from a foundation of approximately 42,000 authentic infographics sourced from Statista, Visual Capitalist, and the ChartGalaxy real-world corpus. The assignment of infographics spans 30 types organized across six high-level categories: Composition, Categorical Comparison, Trend/Evolution, Deviation/Gap, Correlation/Flow, and complex multipanel (“Bonus”) layouts.

Semantic diversity is ensured by per-type k-means clustering over embedding vectors, followed by medoid sampling (Algorithm 1 in the source) and subsequent manual filtration of templated or low-quality cases. For each benchmark sample, a human-in-the-loop protocol extracts (a) a structural design specification (“Create an infographic that…”, detailing chart type, layout, encoding, text, and icons), and (b) the associated ground-truth data table. Both are merged into a single, self-contained T2I prompt that concludes with “The given data is: {data}.”

To evaluate generated outputs, IGenBench formalizes infographic fidelity as a set of binary ("atomic") questions derived from a taxonomy comprising ten categories: Title/Subtitle, Chart/Diagram Type, Decorative/Non-data Elements, Annotations/Callouts, Axes/Scales, Legend/Category Mapping, Data Marks, Data Completeness, Data Ordering, and Data Encoding (Tang et al., 8 Jan 2026).

3. Automated Evaluation Framework

The evaluation framework is a three-stage process, explicitly designed for scalability and strict correctness:

Stage 1: Prompt Decomposition For each infographic generation task, explicit constraints (e.g., presence of titles, chart types) are automatically mapped to yes/no questions ( $Q_p$ ).
Stage 2: Expert-Informed Augmentation To capture critical data-related reliability criteria, three “seed” dimensions—Data Completeness, Data Ordering, Data Encoding—are instantiated as chart-specific questions ( $Q_e$ ), e.g., “Are there exactly $n$ bars, one per category?”, or "Are bar lengths proportional to the given values?"
Stage 3: Multimodal LLM Verification For each generated infographic $I$ and question $q_i \in Q = Q_p \cup Q_e$ , an off-the-shelf multimodal LLM is used as an evaluator. The selected model (Gemini-2.5-Pro) demonstrates high alignment with human judgments (Pearson $r=0.90$ on a held-out subset). The evaluation is strictly binary:

$\mathbb{I}(I, q_i) = \begin{cases} 1, & \text{if } q_i \text{ is clearly satisfied by } I \ 0, & \text{otherwise} \end{cases}$

Ambiguous or partially met criteria are scored as 0, enforcing rigorous end-to-end fidelity.

4. Evaluation Metrics

IGenBench introduces two central metrics to diagnose model reliability:

Question-level Accuracy (Q-ACC):

The fraction of all verification questions correctly satisfied across the benchmark.

$Q\text{-}ACC = \frac{1}{|\mathcal{Q}|} \sum_{q_i \in \mathcal{Q}} \mathbb{I}(I, q_i)$

Here, $\mathcal{Q}$ denotes the union of all decomposed and augmented questions across all infographics.

Infographic-level Accuracy (I-ACC):

The fraction of infographics for which all associated questions are correct (i.e., strict end-to-end correctness).

$\text{I-ACC} = \frac{1}{|\mathcal{I}|} \sum_{I \in \mathcal{I}} \mathbb{I} \Bigl( \sum_{q_i \in \mathcal{Q}(I)} \mathbb{I}(I, q_i) = |\mathcal{Q}(I)| \Bigr)$

Q-ACC functions as a fine-grained per-dimension reliability probe, while I-ACC captures holistic infographic success (Tang et al., 8 Jan 2026).

5. Experimental Analysis of T2I Models

A comprehensive evaluation of ten contemporary T2I models—including open-source systems (Qwen-Image, HiDream-I1, FLUX.1-dev, Z-Image-Turbo) and closed-source commercial solutions (Seedream 4.5, Nanobanana, Nanobanana-Pro, GPT-Image-1.5, Image-01, P-Image)—was conducted. The results reveal a pronounced three-tier performance stratification:

Tier	Representative Models	Q-ACC	I-ACC
1	Nanobanana-Pro	0.90	0.49
2	Seedream 4.5, GPT-Image-1.5	0.61, 0.55	0.06, 0.12
3	All Remaining Models	≤0.48	≈0.00

The aggregate average across all models is Q-ACC = 0.39 and I-ACC = 0.07. Notably, the highest-performing model (Nanobanana-Pro) achieves near-human Q-ACC but only partial I-ACC (0.49), indicating that small errors (e.g., a wrong callout, mis-scaled bar, or missing legend) frequently lead to total infographic failure under strict evaluation rules. Data-related dimensions are universal bottlenecks:

Data Completeness: 0.21
Data Encoding: 0.26
Data Ordering: 0.27

Even best-in-class models (e.g., Nanobanana-Pro) have only moderate reliability on these axes (0.84 Data Completeness; 0.86 Data Encoding) (Tang et al., 8 Jan 2026).

6. Key Insights and Observations

Several high-level findings emerge from systematic analysis:

Model performance stratification reveals persistent capability gaps: infographics require compositional, data-accurate, and semantically aligned generation—outstripping the reliability of leading T2I architectures.
High Q-ACC does not guarantee robust I-ACC: End-to-end reliability is vulnerable to single-point failures, which are frequent in real-world, composite infographic tasks.
Data bottlenecks are pervasive: All evaluated models struggle with Data Completeness, Data Encoding, and Data Ordering, even when visual and textual elements appear superficially correct.
Strong score alignment (Pearson $r=0.90$ ) with human annotators supports the use of automated MLLM-based reliability evaluation.
Only moderate rank correlation with natural-image benchmarks (Spearman $\rho=0.78$ vs. LMArena) indicates that infographic generation is a markedly different problem space, resistant to straightforward transfer of natural-image T2I capabilities.

Additional diagnostic findings include:

Data leakage checks show that performance is generally robust to benchmark contamination, though exceptions (e.g., GPT-Image-1.5) suggest caution for living/continually updated benchmarks.
Disagreement analysis highlights Data Encoding as the most ambiguous verification type (≈12% LLM–human disagreement), identifying a frontier for improving MLLM interpretability and domain calibration.
IGenBench currently does not quantify stylistic or aesthetic factors; evaluation remains narrowly focused on semantic and data reliability.

7. Implications and Future Directions

IGenBench demonstrates that current T2I systems—despite strong advances in visual realism and text rendering—remain unreliable for automated, high-reliability infographic creation. For progress towards trustworthy T2I infographics, promising research directions include:

Explicit data-value preservation modules and loss functions during model training.
Chart-aware architectures capable of maintaining chart semantics under diverse layouts.
Reasoning-in-the-loop pipelines that jointly optimize chart planning, compositional layout, visual encoding, and precise text rendering.
Coupling of fidelity benchmarking with new metrics for aesthetic or creative merit.
Ongoing evolution of IGenBench as a “living benchmark” to detect contamination and adapt to emerging T2I capabilities.

Collectively, IGenBench constitutes an interpretable, extensible platform for rigorous T2I infographic benchmarking, diagnosing the reliability bottlenecks in current-generation models and illuminating the path towards fully autonomous systems with robust semantic and data fidelity (Tang et al., 8 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

IGenBench: Benchmarking the Reliability of Text-to-Infographic Generation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to IGenBench.

IGenBench: T2I Infographic Benchmark

1. Motivation and Context

2. Dataset Construction and Task Taxonomy

3. Automated Evaluation Framework

4. Evaluation Metrics

5. Experimental Analysis of T2I Models

6. Key Insights and Observations

7. Implications and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

IGenBench: T2I Infographic Benchmark

1. Motivation and Context

2. Dataset Construction and Task Taxonomy

3. Automated Evaluation Framework

4. Evaluation Metrics

5. Experimental Analysis of T2I Models

6. Key Insights and Observations

7. Implications and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research