LPG-Bench: Long-Prompt Benchmark for T2I Models
- LPG-Bench is a long-prompt benchmark assessing text-to-image models by using 200 prompts with over 250 words each and detailed human annotations.
- It employs a multi-phase methodology to generate diverse, coherent, and compositionally rich prompts from 500 candidate themes refined for narrative consistency.
- The framework introduces TIT-Score, a novel metric measuring semantic alignment between long textual inputs and generated images, significantly outperforming prior metrics.
LPG-Bench is a long-prompt-based benchmark and evaluation framework designed to rigorously assess the alignment capabilities of state-of-the-art text-to-image (T2I) models with respect to rich, detailed, and complex input instructions. Developed to address the limitations of existing short-prompt-centric benchmarks, LPG-Bench comprises a curated dataset of 200 prompts each averaging over 250 words, matched against comprehensive human annotations and a novel evaluation metric, TIT-Score. The methods and results underpinning LPG-Bench reveal critical insights into the fidelity of image generation relative to extended textual inputs and offer meaningful advances in measurement methodology for the multimodal research community (Wang et al., 3 Oct 2025).
1. Construction of Long-Prompt Benchmark Dataset
LPG-Bench's dataset generation process employs a multi-phase methodology focused on maximizing prompt diversity and richness of detail:
- An initial set of 500 thematic candidates is generated via a LLM, such as Gemini 2.5 Pro.
- Manual screening selects 200 core themes to ensure coverage of distinct subjects.
- Each theme is expanded to satisfy a strict minimum word count (≥ 250 words), leveraging LLM-driven instructions for visual, stylistic, and compositional detail.
- Final prompts are manually refined to maintain narrative coherence and logical consistency.
This pipeline guarantees that prompts closely approach the input length constraints of commercial T2I systems and comprehensively stress-test the models’ ability to resolve multi-faceted, compositional instructions.
2. Image Generation Protocol Across Multiple Model Architectures
Utilizing the LPG-Bench prompt suite, the benchmark involves the generation of 2,600 images using 13 distinct state-of-the-art T2I models. Represented architectures include Diffusion, Autoregressive, Flow-based, and unified multimodal systems spanning both closed-source commercial and open-source frameworks. This cross-section enables broad generalization of results and exhaustive exploration of contemporary T2I design paradigms.
3. Human Annotation and Consensus Aggregation
To establish an authoritative performance reference, LPG-Bench incorporates a large-scale human ranking protocol:
- 15 annotators apply detailed guidelines for pairwise model comparison on prompt adherence, with the option to mark ties.
- Clear consensus (e.g., ≥10 votes) determines the superior generated image; ambiguous cases invoke expert panel review.
- From the 2,600 prompts and associated images, 12,832 non-tied comparisons are compiled as the “gold standard”.
The systematic aggregation of expert judgments provides robust ground truth for evaluation metric benchmarking and reliably captures the nuanced aspects of long-prompt compliance.
4. TIT-Score: Text-to-Image-to-Text Consistency Metric
To address the inadequacy of existing metrics like CLIP-score and LMM-score on long prompts, LPG-Bench introduces TIT-Score. The approach is predicated on semantic consistency between the prompt and the image as interpreted by a vision-LLM (VLM):
- Visual-to-Text Captioning: A selected VLM (e.g., BLIP2, Qwen-VL) generates a rich caption describing the image .
- Semantic Alignment Assessment:
- Standard TIT-Score: Embedding models (e.g., Qwen3-Embedding) encode both the original prompt and . Cosine similarity computes the alignment:
- TIT-Score-LLM: A powerful LLM (e.g., Gemini 2.5 Pro) compares and directly for semantic similarity, without recourse to embedding compression.
Both variants are zero-shot methods, obviating the need for fine-tuning on annotated dataset.
5. Comparative Experimental Evaluation and Metric Analysis
LPG-Bench's empirical protocol encompasses the following components:
- Pairwise Accuracy: The proportion of image pairs correctly rank-ordered versus human annotator consensus.
- Correlation Metrics: Spearman's rank (SRCC), Kendall's tau (KRCC), and nDCG for leaderboards alignment.
Key findings include:
- Existing automated metrics (CLIP-Score, BLIPv2-Score) demonstrate poor consistency with human judgment in the long-prompt regime.
- The best baseline, LMM4LMM, is outperformed by TIT-Score-LLM with a 7.31% absolute improvement in pairwise accuracy (TIT-Score-LLM achieves 66.51% pairwise accuracy).
- TIT-Score correlates strongly with human leaderboards (SRCC ≈ 0.929), evidencing superior stability and fidelity on complex inputs.
6. Implications for Future Text-to-Image Systems and Benchmarking
The enhanced alignment capabilities of TIT-Score-LLM and the broad coverage of LPG-Bench have immediate and long-term implications:
- Decoupled visual and textual alignment evaluation is validated as an effective approach, especially as prompt length and compositional complexity increase.
- T2I models persistently struggle with long-form instruction adherence, motivating further research into both prompt-parsing and image synthesis mechanisms.
- LPG-Bench and TIT-Score provide reliable public resources for iterative model evaluation, targeted fine-tuning, and rapid progress tracking within the multimodal domain.
7. Resource Availability and Benchmarking Utility
All data—including the long-prompt corpus, image sets from multiple architectures, annotator rankings, and TIT-Score implementations—are designated for public release. This enables reproducibility and comparability across new T2I system designs and supports the evolution of alignment-sensitive multimodal benchmarks.
LPG-Bench sets a new technical standard for evaluating and comparing model performance on rich, realistic, and instructionally challenging input, with its associated TIT-Score methodology offering precise, scalable, and human-consistent assessment in the contemporary research landscape (Wang et al., 3 Oct 2025).