Papers
Topics
Authors
Recent
2000 character limit reached

LPG-Bench: Long-Prompt Benchmark for T2I Models

Updated 6 October 2025
  • LPG-Bench is a long-prompt benchmark assessing text-to-image models by using 200 prompts with over 250 words each and detailed human annotations.
  • It employs a multi-phase methodology to generate diverse, coherent, and compositionally rich prompts from 500 candidate themes refined for narrative consistency.
  • The framework introduces TIT-Score, a novel metric measuring semantic alignment between long textual inputs and generated images, significantly outperforming prior metrics.

LPG-Bench is a long-prompt-based benchmark and evaluation framework designed to rigorously assess the alignment capabilities of state-of-the-art text-to-image (T2I) models with respect to rich, detailed, and complex input instructions. Developed to address the limitations of existing short-prompt-centric benchmarks, LPG-Bench comprises a curated dataset of 200 prompts each averaging over 250 words, matched against comprehensive human annotations and a novel evaluation metric, TIT-Score. The methods and results underpinning LPG-Bench reveal critical insights into the fidelity of image generation relative to extended textual inputs and offer meaningful advances in measurement methodology for the multimodal research community (Wang et al., 3 Oct 2025).

1. Construction of Long-Prompt Benchmark Dataset

LPG-Bench's dataset generation process employs a multi-phase methodology focused on maximizing prompt diversity and richness of detail:

  • An initial set of 500 thematic candidates is generated via a LLM, such as Gemini 2.5 Pro.
  • Manual screening selects 200 core themes to ensure coverage of distinct subjects.
  • Each theme is expanded to satisfy a strict minimum word count (≥ 250 words), leveraging LLM-driven instructions for visual, stylistic, and compositional detail.
  • Final prompts are manually refined to maintain narrative coherence and logical consistency.

This pipeline guarantees that prompts closely approach the input length constraints of commercial T2I systems and comprehensively stress-test the models’ ability to resolve multi-faceted, compositional instructions.

2. Image Generation Protocol Across Multiple Model Architectures

Utilizing the LPG-Bench prompt suite, the benchmark involves the generation of 2,600 images using 13 distinct state-of-the-art T2I models. Represented architectures include Diffusion, Autoregressive, Flow-based, and unified multimodal systems spanning both closed-source commercial and open-source frameworks. This cross-section enables broad generalization of results and exhaustive exploration of contemporary T2I design paradigms.

3. Human Annotation and Consensus Aggregation

To establish an authoritative performance reference, LPG-Bench incorporates a large-scale human ranking protocol:

  • 15 annotators apply detailed guidelines for pairwise model comparison on prompt adherence, with the option to mark ties.
  • Clear consensus (e.g., ≥10 votes) determines the superior generated image; ambiguous cases invoke expert panel review.
  • From the 2,600 prompts and associated images, 12,832 non-tied comparisons are compiled as the “gold standard”.

The systematic aggregation of expert judgments provides robust ground truth for evaluation metric benchmarking and reliably captures the nuanced aspects of long-prompt compliance.

4. TIT-Score: Text-to-Image-to-Text Consistency Metric

To address the inadequacy of existing metrics like CLIP-score and LMM-score on long prompts, LPG-Bench introduces TIT-Score. The approach is predicated on semantic consistency between the prompt and the image as interpreted by a vision-LLM (VLM):

  1. Visual-to-Text Captioning: A selected VLM (e.g., BLIP2, Qwen-VL) generates a rich caption CcaptionC_\text{caption} describing the image II.
  2. Semantic Alignment Assessment:

    • Standard TIT-Score: Embedding models (e.g., Qwen3-Embedding) encode both the original prompt PP and CcaptionC_\text{caption}. Cosine similarity computes the alignment:

    TIT-Score=cosθ=VpromptVcaptionVpromptVcaption\text{TIT-Score} = \cos\theta = \frac{\mathbf{V}_{\text{prompt}} \cdot \mathbf{V}_{\text{caption}}}{\|\mathbf{V}_{\text{prompt}}\| \, \|\mathbf{V}_{\text{caption}}\|}

  • TIT-Score-LLM: A powerful LLM (e.g., Gemini 2.5 Pro) compares PP and CcaptionC_\text{caption} directly for semantic similarity, without recourse to embedding compression.

Both variants are zero-shot methods, obviating the need for fine-tuning on annotated dataset.

5. Comparative Experimental Evaluation and Metric Analysis

LPG-Bench's empirical protocol encompasses the following components:

  • Pairwise Accuracy: The proportion of image pairs correctly rank-ordered versus human annotator consensus.
  • Correlation Metrics: Spearman's rank (SRCC), Kendall's tau (KRCC), and nDCG for leaderboards alignment.

Key findings include:

  • Existing automated metrics (CLIP-Score, BLIPv2-Score) demonstrate poor consistency with human judgment in the long-prompt regime.
  • The best baseline, LMM4LMM, is outperformed by TIT-Score-LLM with a 7.31% absolute improvement in pairwise accuracy (TIT-Score-LLM achieves 66.51% pairwise accuracy).
  • TIT-Score correlates strongly with human leaderboards (SRCC ≈ 0.929), evidencing superior stability and fidelity on complex inputs.

6. Implications for Future Text-to-Image Systems and Benchmarking

The enhanced alignment capabilities of TIT-Score-LLM and the broad coverage of LPG-Bench have immediate and long-term implications:

  • Decoupled visual and textual alignment evaluation is validated as an effective approach, especially as prompt length and compositional complexity increase.
  • T2I models persistently struggle with long-form instruction adherence, motivating further research into both prompt-parsing and image synthesis mechanisms.
  • LPG-Bench and TIT-Score provide reliable public resources for iterative model evaluation, targeted fine-tuning, and rapid progress tracking within the multimodal domain.

7. Resource Availability and Benchmarking Utility

All data—including the long-prompt corpus, image sets from multiple architectures, annotator rankings, and TIT-Score implementations—are designated for public release. This enables reproducibility and comparability across new T2I system designs and supports the evolution of alignment-sensitive multimodal benchmarks.

LPG-Bench sets a new technical standard for evaluating and comparing model performance on rich, realistic, and instructionally challenging input, with its associated TIT-Score methodology offering precise, scalable, and human-consistent assessment in the contemporary research landscape (Wang et al., 3 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to LPG-Bench.