Papers
Topics
Authors
Recent
2000 character limit reached

LGBench: Multi-Goal Eval for Gen Models

Updated 29 December 2025
  • LGBench is a large-scale evaluation suite featuring 2,000 tasks with 18-22 interdependent goals per prompt, designed to test real-world multi-goal design challenges.
  • It exposes that even advanced generative models satisfy fewer than 72% of precise objectives, highlighting critical weaknesses in handling localized edits like text and object placement.
  • The benchmark integrates automated prompt curation, structured annotation, and a closed-loop refinement agent (VisionDirector) with GRPO, yielding measurable performance gains.

Long Goal Bench (LGBench) is a comprehensive, large-scale evaluation suite specifically designed to expose limitations in generative models when subjected to long, tightly coupled, multi-goal design prompts such as those issued by professional artists and designers. It contains 2,000 tasks—1,000 text-to-image (T2I) and 1,000 image-to-image (I2I)—each with complex instructions averaging 18 to 22 interdependent goals that span global layout, local object editing, typography, logo placement, and fine-grained visual fidelity. Even the most advanced models consistently satisfy fewer than 72% of these goals, with typical failures concentrated on precise or localized edits, demonstrating the brittleness of current one-shot pipelines (Chu et al., 22 Dec 2025).

1. Motivation and Problem Definition

State-of-the-art diffusion-based generative models excel at photorealism and aesthetics but are fundamentally limited when processing real-world, multi-part design briefs. Existing benchmarks such as DrawBench, TIFA, and MagicBrush include at most one or two goals per prompt, obscuring model weaknesses in handling instruction sequences with high goal density and mutual dependence. LGBench systematically raises task complexity, with professional-style prompts containing 10–23 detailed objectives—spanning requirements for composition, color harmonization, precise text, effect overlays, lighting conditions, logo integration, and complex pose arrangements. Data analysis on LGBench reveals that leading models miss localized goals at a high rate (goal coverage ≤72%), particularly for object placement, text, and lighting (Chu et al., 22 Dec 2025).

2. Construction of LGBench

LGBench construction combines automated large-LLM prompt curation and structured annotation. The T2I subset comprises 1,000 prompts, drawn from 200 high-level and 418 subcategories, with an average of 18.0 goals per prompt (18,035 total goals). The I2I subset includes 1,000 Flux-Krea reference images spanning 29 coarse classes and 710 subcategories, with an average of 11.2 edit directives (11,217 total goals). Prompts are composed via Claude 4.5, while each goal is annotated with explicit type and strength. For evaluation, Qwen3-VL-32B is used as a verifier with a minimum confidence threshold of 0.81 (Chu et al., 22 Dec 2025).

Key statistics:

Subset #Tasks Avg. Goals/Task Total Goals Categories/Subcategories
T2I 1,000 18.0 18,035 200/418
I2I 1,000 11.2 11,217 29/710

Tasks require coordinated fulfillment of global directives (e.g., composition, lighting) and detailed local objectives (e.g., “Add ‘SALE’ in bold sans-serif at bottom right”, “Position logo below lantern with 10 px margin”), mirroring real-world workflow.

3. Evaluation Metrics and Benchmarking Protocol

LGBench adopts a suite of granular success metrics reflecting both per-goal and per-task performance:

  • Per-goal success rate: Fraction of goals per task verified as "pass" by the automated verifier.
  • Task-level Finish: Fraction of tasks in which at least 80% of goals are satisfied.
  • GenEval (T2I): Aggregates six submetrics—single-object accuracy, two-object composition, counting, color fidelity, positional alignment, and attribute binding.
  • ImgEdit (I2I): Averages qualitative scores (on a 1–5 scale) from GPT-4.1 over nine image-editing primitives.
  • Efficiency: Mean number of edit rounds (diffusion calls) required until task is terminated.

Peer comparison on LGBench, GenEval, and ImgEdit shows that even the best baseline models plateau below 72% goal coverage, confirming persistent performance bottlenecks in handling coupled directives. VisionDirector, a modular, closed-loop refinement agent, raises goal satisfaction by 3–22% across backbones and yields especially large improvements in text, additive object, and lighting goal categories (+7.5–28.7%, +2.5–25.3%, +1.5–19.4%, respectively) (Chu et al., 22 Dec 2025).

4. VisionDirector: Architecture and Algorithmic Components

Built to address the brittleness exposed by LGBench, VisionDirector introduces a closed-loop, modular agent architecture that layers vision–language-driven planning on top of pretrained diffusion models. It is structured as follows:

  1. Structured Goal Extraction: Uses Qwen3‐VL‐8B to convert instructions into a pending set of goals, each tagged by type, conflict potential, and estimated one-shot feasibility.
  2. Dynamic Decision-Making: The planner selects between
    • One-shot (joint) generation (if summed feasibility is high and area impact limited), or
    • Staged micro-edits by grouping 1–2 goals, proceeding from global to local.
  3. Micro-Grid Sampling with Semantic Verification and Rollback:
    • For each batch, N candidates are generated with the editor.
    • A verification VLM scores candidates for goal satisfaction, selecting the best image.
    • If net alignment degrades, previous best is restored and conflicting goals rescheduled.
  4. Goal-Level Reward Logging: Each iteration logs binary or graded rewards per goal, facilitating policy optimization and detailed auditing.

Pseudocode for the top-level controller is provided in (Chu et al., 22 Dec 2025); see the section above for stepwise breakdown.

5. Group Relative Policy Optimization (GRPO)

To optimize the planner for shorter edit trajectories and higher fidelity, VisionDirector leverages Group Relative Policy Optimization (GRPO), a PPO-style objective adapted for multi-goal refinement:

JGRPO(θ)=Ex,{y(i)}i=1G[1Gi=1G1tI(yt(i))tI(yt(i))Lclip(ρt(i),A^t(i))βKL(πθπref)]\mathcal{J}_{\mathrm{GRPO}(\theta)} = \mathbb{E}_{x,\{y^{(i)}\}_{i=1}^G}\Biggl[ \frac{1}{G}\sum_{i=1}^G \frac{1}{\sum_t I(y_t^{(i)})} \sum_t I(y_t^{(i)})\, \mathcal{L}_{\mathrm{clip} \bigl(\rho_t^{(i)},\hat A_t^{(i)}\bigr)} - \beta\,\mathrm{KL}\bigl(\pi_\theta\,\|\,\pi_{\mathrm{ref}})\Biggr]

Here, y(i)y^{(i)} is the iith sampled action sequence, I(yt)I(y_t) masks out non-planner tokens, ρt(i)\rho_t^{(i)} is the importance ratio, A^t(i)\hat A_t^{(i)} the group-normalized advantage, and β\beta regularizes toward the reference policy.

Advantages versus standard PPO:

  • Group normalization—encourages concise, high-reward edit policies by comparing across multiple trajectories.
  • Token masking—ensures gradients are constrained to planner outputs only.
  • Dense VLM-alignment reward—provides per-goal feedback with a 0–5 scale for superior supervision.

Fine-tuning with GRPO reduces the median edit rounds by 26% (4.2 to 3.1), increases per-task goal coverage (from 0.74 to 0.78), and reduces mean diffusion calls (3.3 to 2.5) (Chu et al., 22 Dec 2025).

6. Results, Insights, and Limitations

Quantitative Gains

  • GenEval: VisionDirector improves overall score to 0.94 (vs. 0.87), with marked advances in relative position (0.76 → 0.88) and attribute binding (0.77 → 0.95).
  • ImgEdit: Improves to 4.35 from 4.27, outperforming all open-source and matching most closed-source baselines on typography, hybrid, and action edits.
  • On LGBench: Raises goal coverage by up to 22% across several backbones.

Qualitative Improvements

  • Accurate typography on complex backgrounds, precise multi-object scene management, and fine-grained pose edits.

Limitations

  • Verifier reliability: Dependent on VLM ability to parse subtle cues; goals under a confidence of 0.81 are dropped, possibly missing rare failures.
  • Latency: Micro-grid sampling and VLM verification introduce runtime overhead versus single-shot pipelines.
  • Human-in-the-loop: Subjective or stylistic judgments may still require human override, despite logging full goal-level traces.

Potential Extensions

  • Verifier ensembles combining VLMs with OCR or geometry checkers.
  • Human–AI collaborative loops by allowing mid-process interjection.
  • Extension to temporal (video) and spatial (3D) asset editing, introducing challenges in consistency across frames and space.

7. Implications and Research Directions

LGBench exposes a critical gap between current generative models' capabilities and the requirements for real-world, professional multi-goal synthesis. VisionDirector’s modular, VLM-driven supervisory architecture—augmented with closed-loop micro-editing, rollback, and granular reward signals—demonstrates that bridging this gap is possible by integrating structured planning and hierarchical policy optimization on top of pretrained diffusion editors.

A plausible implication is that future benchmarks and model architectures must simultaneously address scaling to many interdependent goals, granular semantic verification, and efficient trajectory optimization in complex, high-dimensional editing spaces. The LGBench benchmark serves as a standard for comprehensive model evaluation in this domain, and any claims of real-world generative intelligence should demonstrate robust performance on such multi-goal task suites (Chu et al., 22 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Long Goal Bench (LGBench).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube