LPG-Bench: Long-Prompt Benchmark for T2I Models

Updated 6 October 2025

LPG-Bench is a long-prompt benchmark assessing text-to-image models by using 200 prompts with over 250 words each and detailed human annotations.
It employs a multi-phase methodology to generate diverse, coherent, and compositionally rich prompts from 500 candidate themes refined for narrative consistency.
The framework introduces TIT-Score, a novel metric measuring semantic alignment between long textual inputs and generated images, significantly outperforming prior metrics.

LPG-Bench is a long-prompt-based benchmark and evaluation framework designed to rigorously assess the alignment capabilities of state-of-the-art text-to-image (T2I) models with respect to rich, detailed, and complex input instructions. Developed to address the limitations of existing short-prompt-centric benchmarks, LPG-Bench comprises a curated dataset of 200 prompts each averaging over 250 words, matched against comprehensive human annotations and a novel evaluation metric, TIT-Score. The methods and results underpinning LPG-Bench reveal critical insights into the fidelity of image generation relative to extended textual inputs and offer meaningful advances in measurement methodology for the multimodal research community (Wang et al., 3 Oct 2025).

1. Construction of Long-Prompt Benchmark Dataset

LPG-Bench's dataset generation process employs a multi-phase methodology focused on maximizing prompt diversity and richness of detail:

An initial set of 500 thematic candidates is generated via a LLM, such as Gemini 2.5 Pro.
Manual screening selects 200 core themes to ensure coverage of distinct subjects.
Each theme is expanded to satisfy a strict minimum word count (≥ 250 words), leveraging LLM-driven instructions for visual, stylistic, and compositional detail.
Final prompts are manually refined to maintain narrative coherence and logical consistency.

This pipeline guarantees that prompts closely approach the input length constraints of commercial T2I systems and comprehensively stress-test the models’ ability to resolve multi-faceted, compositional instructions.

2. Image Generation Protocol Across Multiple Model Architectures

Utilizing the LPG-Bench prompt suite, the benchmark involves the generation of 2,600 images using 13 distinct state-of-the-art T2I models. Represented architectures include Diffusion, Autoregressive, Flow-based, and unified multimodal systems spanning both closed-source commercial and open-source frameworks. This cross-section enables broad generalization of results and exhaustive exploration of contemporary T2I design paradigms.

3. Human Annotation and Consensus Aggregation

To establish an authoritative performance reference, LPG-Bench incorporates a large-scale human ranking protocol:

15 annotators apply detailed guidelines for pairwise model comparison on prompt adherence, with the option to mark ties.
Clear consensus (e.g., ≥10 votes) determines the superior generated image; ambiguous cases invoke expert panel review.
From the 2,600 prompts and associated images, 12,832 non-tied comparisons are compiled as the “gold standard”.

The systematic aggregation of expert judgments provides robust ground truth for evaluation metric benchmarking and reliably captures the nuanced aspects of long-prompt compliance.

4. TIT-Score: Text-to-Image-to-Text Consistency Metric

To address the inadequacy of existing metrics like CLIP-score and LMM-score on long prompts, LPG-Bench introduces TIT-Score. The approach is predicated on semantic consistency between the prompt and the image as interpreted by a vision-LLM (VLM):

Visual-to-Text Captioning: A selected VLM (e.g., BLIP2, Qwen-VL) generates a rich caption $C_\text{caption}$ describing the image $I$ .
Semantic Alignment Assessment:
- Standard TIT-Score: Embedding models (e.g., Qwen3-Embedding) encode both the original prompt $P$ and $C_\text{caption}$ . Cosine similarity computes the alignment:
$\text{TIT-Score} = \cos\theta = \frac{\mathbf{V}_{\text{prompt}} \cdot \mathbf{V}_{\text{caption}}}{\|\mathbf{V}_{\text{prompt}}\| \, \|\mathbf{V}_{\text{caption}}\|}$

TIT-Score-LLM: A powerful LLM (e.g., Gemini 2.5 Pro) compares $P$ and $C_\text{caption}$ directly for semantic similarity, without recourse to embedding compression.

Both variants are zero-shot methods, obviating the need for fine-tuning on annotated dataset.

5. Comparative Experimental Evaluation and Metric Analysis

LPG-Bench's empirical protocol encompasses the following components:

Pairwise Accuracy: The proportion of image pairs correctly rank-ordered versus human annotator consensus.
Correlation Metrics: Spearman's rank (SRCC), Kendall's tau (KRCC), and nDCG for leaderboards alignment.

Key findings include:

Existing automated metrics (CLIP-Score, BLIPv2-Score) demonstrate poor consistency with human judgment in the long-prompt regime.
The best baseline, LMM4LMM, is outperformed by TIT-Score-LLM with a 7.31% absolute improvement in pairwise accuracy (TIT-Score-LLM achieves 66.51% pairwise accuracy).
TIT-Score correlates strongly with human leaderboards (SRCC ≈ 0.929), evidencing superior stability and fidelity on complex inputs.

6. Implications for Future Text-to-Image Systems and Benchmarking

The enhanced alignment capabilities of TIT-Score-LLM and the broad coverage of LPG-Bench have immediate and long-term implications:

Decoupled visual and textual alignment evaluation is validated as an effective approach, especially as prompt length and compositional complexity increase.
T2I models persistently struggle with long-form instruction adherence, motivating further research into both prompt-parsing and image synthesis mechanisms.
LPG-Bench and TIT-Score provide reliable public resources for iterative model evaluation, targeted fine-tuning, and rapid progress tracking within the multimodal domain.

7. Resource Availability and Benchmarking Utility

All data—including the long-prompt corpus, image sets from multiple architectures, annotator rankings, and TIT-Score implementations—are designated for public release. This enables reproducibility and comparability across new T2I system designs and supports the evolution of alignment-sensitive multimodal benchmarks.

LPG-Bench sets a new technical standard for evaluating and comparing model performance on rich, realistic, and instructionally challenging input, with its associated TIT-Score methodology offering precise, scalable, and human-consistent assessment in the contemporary research landscape (Wang et al., 3 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

TIT-Score: Evaluating Long-Prompt Based Text-to-Image Alignment via Text-to-Image-to-Text Consistency (2025)

LPG-Bench: Long-Prompt Benchmark for T2I Models

1. Construction of Long-Prompt Benchmark Dataset

2. Image Generation Protocol Across Multiple Model Architectures

3. Human Annotation and Consensus Aggregation

4. TIT-Score: Text-to-Image-to-Text Consistency Metric

5. Comparative Experimental Evaluation and Metric Analysis

6. Implications for Future Text-to-Image Systems and Benchmarking

7. Resource Availability and Benchmarking Utility

Whiteboard

Follow Topic

Continue Learning

LPG-Bench: Long-Prompt Benchmark for T2I Models

1. Construction of Long-Prompt Benchmark Dataset

2. Image Generation Protocol Across Multiple Model Architectures

3. Human Annotation and Consensus Aggregation

4. TIT-Score: Text-to-Image-to-Text Consistency Metric

5. Comparative Experimental Evaluation and Metric Analysis

6. Implications for Future Text-to-Image Systems and Benchmarking

7. Resource Availability and Benchmarking Utility

Whiteboard

Follow Topic

Continue Learning

Related Topics