Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback

Published 4 Jun 2026 in cs.CV | (2606.06113v1)

Abstract: Despite generating increasingly photorealistic images, text-to-image (T2I) models still exhibit localized, subtle, and structurally complex failures. Diagnosing these failures requires instance-level feedback that answers where a defect occurs, what type it is, why it is defective, and its importance to overall image quality. While recent dense-feedback methods move beyond scalar supervision, their heatmap-centric representations still formulate diagnosis as pixel-field regression, making it difficult to localize variable-cardinality defects and bind semantic reasons to individual failures. To address this representation bottleneck, we propose Structured Defect Grounding (SDG), which casts T2I diagnosis as structured set prediction by modeling each defect as a (location, type, reason, importance) tuple. To make this formulation trainable and measurable, we introduce SDG-30K, a 30K-image dataset with box-grounded annotations across four modern T2I generators, together with a dedicated evaluation protocol, SDG-Eval. Building on this structured representation, we further present a diagnosis-to-alignment framework in which a Vision-LLM (VLM) serves as the SDG detector, and BoxFlow-GRPO converts predicted defect sets into box-derived, importance-weighted spatial rewards for diffusion model alignment. Extensive experiments show that our SDG detector outperforms leading proprietary VLMs on structured defect grounding, while SDG-guided rewards consistently improve T2I alignment and support localized image refinement. These results establish SDG as a unified, instance-level interface for diagnosing, evaluating, and enhancing modern generative models.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper presents a novel instance-level structured defect grounding method that represents text-to-image failures as precise quartets of location, type, reason, and importance.
It introduces the SDG-30K dataset and a dedicated evaluation protocol, achieving competitive performance in defect localization and semantic alignment compared to traditional heatmap methods.
The framework enhances downstream applications by enabling refined reward shaping in diffusion models and guiding defect-specific image improvements.

Structured Defect Grounding for Text-to-Image Feedback: Technical Overview and Implications

Motivation and Context

Text-to-image (T2I) models have advanced in photorealism yet continue to manifest localized, subtle, and semantically entangled defects that global or pixel-wise supervision cannot adequately characterize. Prevailing scalar or dense heatmap feedback paradigms—such as those in RichHF and ImageDoctor—reduce complex defects to coarse spatial or aggregate signals, failing to connect specific error types to their semantic context and precise location. This disconnect impedes principled model alignment, granular evaluation, and actionable refinement of generative systems.

The "Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback" paper (SDG) (2606.06113) addresses this representational bottleneck by reframing T2I defect diagnosis as instance-level structured set prediction, whereby each defect is expressed as a quadruple: location (bounding box), type (artifact or misalignment), natural-language reason, and importance score.

Structured Defect Grounding Formulation

SDG introduces a unified representation for T2I failures, encoding each as a tuple $(b_i, t_i, r_i, s_i)$ , with $b$ as a quantized bounding box, $t$ as a categorical defect type (artifact, misalignment), $r$ as a free-form reason, and $s$ as an integer importance reflecting perceptual and semantic impact. This structured approach enables both fine-grained attribution of failures and prioritized, context-aware evaluation essential for complex generative errors.

Figure 1: Qualitative comparison between heatmap-based and SDG-style structured feedback, contrasting coarse artifact/misalignment maps with instance-level bounding boxes, explicit types, chain-of-thought traces, and importance scores.

SDG-30K Dataset and Evaluation Protocol

To enable systematic training and evaluation, SDG-30K, a 30k-image dataset, is constructed with human box-level defect annotations across four modern T2I generators. Each instance is labeled with bounding box, type, concise description (post-processed via Gemini 3 Pro for detailed English reasons and importance), covering both artifact and misalignment failures.

A dedicated evaluation protocol (SDG-Eval) is established:

Image-level metrics: Detection F1 for presence of each defect type, clean-image accuracy.
Defect-level metrics: Class-aware Hungarian matching between predicted and ground-truth instances yields [email protected]/0.5 (localization), DescCos (description similarity via Qwen embeddings), and ImpAcc (importance score accuracy).

The dataset reveals nontrivial rates of both artifact and misalignment defects—underscoring the insufficiency of scalar feedback in characterizing present-day T2I failures.

Figure 2: SDG framework: dataset construction pipeline, two-stage detector training (SFT + GRPO), and downstream applications including diffusion reward shaping and defect-guided image refinement.

SDG Detector: Model Architecture and Optimization

The SDG detector operationalizes the framework by treating T2I defect localization as structured vision-language generation. The base model (Qwen3-VL-4B-Instruct) is fine-tuned in two stages:

Supervised Fine-Tuning (SFT): Trains the model to produce valid structured output, augmented with coordinate jitter for robustness.
Group Relative Policy Optimization (GRPO): Optimizes localization, description alignment, and importance estimation via a composite reward, format gating, and clipped likelihood objectives.

Crucially, the detector outputs both a reasoning trace (chain-of-thought for interpretability) and a defect set in JSON, natively compatible with autoregressive VLM decoders.

Quantitative and Qualitative Results

Numerical Performance: On SDG-30K, the SDG detector with GRPO achieves artifact/misalignment [email protected] scores of 0.263/0.387—significantly close to human upper bounds (0.278/0.409). Description cosine and importance metrics are consistently high ( $>0.88$ ). Notably, zero-shot GPT-5.4 and Gemini 3 Pro fall short by large margins in localization and misalignment recall, substantiating the necessity for in-domain, structured supervision and optimization.

Cross-Dataset Generalization: When evaluated zero-shot on RichHF-18K, SDG yields misalignment F1 nearly 3x that of ImageDoctor, demonstrating superior transferability in capturing prompt-conditioned failures versus architectures centered on heatmap regression.

Figure 3: Qualitative comparison on SDG-30K between SDG, ImageDoctor, and ground-truth, illustrating precise instance-level grounding and type attribution.

Figure 4: Extended evaluation showing the fidelity of SDG in complex, heterogeneous defect scenarios.

Downstream Applications

Diffusion Model Alignment via BoxFlow-GRPO

SDG outputs serve as structured, importance-weighted spatial penalization maps within the BoxFlow-GRPO framework for diffusion model RL. Unlike previous approaches that modulate scalar rewards by predicted defect heatmaps, BoxFlow-GRPO constructs, for each latent location, reward signals proportional to the detected importance of artifact/misalignment boxes.

Empirical Results: BoxFlow-GRPO achieves the highest average relative improvement (+2.4%) across preference and quality benchmarks while uniquely increasing the real-image likelihood, a dimension where scalar-masked and heatmap-driven baselines regress—quantitatively mitigating reward hacking and preserving photographic realism.

Figure 5: Visual comparison illustrating the benefit of structured spatial reward on output fidelity and prompt alignment.

Figure 6: Extended evaluation of BoxFlow-GRPO on challenging prompts, demonstrating robust compositional and attribute faithfulness.

SDG also enables semantically meaningful, localized correction through structured feedback fed into GPT-Image-1.5, surpassing both caption-only and heatmap/text-based editing in human GSB (Good/Same/Bad) preference rates.

Figure 7: Attribute-corrective and artifact-removal effectiveness of SDG-guided image editing, particularly in semantically subtle prompt-conditioned mismatches.

Theoretical and Practical Implications

This work establishes structure-aware, instance-level feedback as a general interface for evaluating and aligning generative models beyond the reach of global metrics or coarse pixel-wise feedback. The SDG formalism allows fine-grained reward shaping in diffusion model RL, exposing reward hacking that escapes scalar detectors and providing actionable signals for downstream editing.

Practically, this advances interpretability and failure diagnosis in T2I, supports modular system design (decoupling detector and generator), and serves as a foundation for extensible feedback taxonomies (future additions: safety, aesthetics, etc.). Theoretically, SDG highlights the unique alignment leverage of structured, compositional error representation and advances the interface between vision-language modeling and RL in generative alignment.

Future Directions

Subsequent work is expected to exploit the modularity of SDG tuples, expanding the set of defect types, introducing richer spatial primitives (e.g., masks, relation graphs), or incorporating subjective attributes (aesthetic, safety) into structured feedback. Integration with active learning and adversarial detection frameworks appears promising for bootstrapping the annotation of failure cases. Additionally, leveraging synthetic or web-scale sources for scalable structured feedback remains an open research area for adapting SDG to further diverse domains.

Conclusion

Structured Defect Grounding furnishes a unified, extensible instance-level protocol for T2I diagnosis, evaluation, and alignment. Through both rigorous empirical validation and modular downstream application, it advances the granularity, generalizability, and interpretability of feedback in modern generative modeling (2606.06113).

Markdown Report Issue