- The paper shows that selective aggregation of attention maps significantly improves segmentation accuracy, achieving a mean IoU increase from 0.7540 to 0.7765.
- The methodology employs a Head Relevance Vector to identify and average only the top 20–25% of attention heads that align with human-interpretable concepts.
- The approach aids in debugging and prompt disambiguation by clearly distinguishing semantic focus regions, enhancing model interpretability.
Selective Aggregation of Attention Maps for Improved Diffusion-Based Visual Interpretation
Introduction
This paper addresses the challenge of interpreting and controlling text-to-image (T2I) generative models, specifically diffusion models such as Stable Diffusion, by analyzing and selectively aggregating cross-attention maps. Previous approaches, most notably DAAM [tang2022daam], have relied on aggregating attention maps across all heads to visualize semantic correspondences between prompt tokens and generated image regions. However, these methods overlook the heterogeneous functionality across attention heads, as established in recent work [park2024cross]. The authors demonstrate that restricting aggregation to the most concept-relevant heads, determined via the Head Relevance Vector (HRV) framework, significantly enhances both interpretability and segmentation performance compared to indiscriminate averaging.
Methodology
The foundation of this work is the observation that cross-attention heads in T2I models specialize for distinct semantic concepts and that this specialization can be leveraged for improved analysis. The methodology comprises:
- Attention Head Relevance Estimation: HRV quantifies the alignment between each head and human-interpretable concepts using concept-word embedding activations, yielding a relevance vector for each semantic category.
- Selective Aggregation: For a given concept, only the subset (top 20–25%) of attention heads with highest HRV scores are used to average attention maps, as opposed to DAAM's aggregation across all heads.
- Benchmarking and Thresholding: The alignment of aggregated attention maps with ground-truth segmentations is evaluated via mean Intersection over Union (IoU) using the Grounded-SAM model [ren2024grounded] as reference.
Empirical Results
The experimental evaluation uses Stable Diffusion v1.4 and focuses on the 'Animals' concept with a standardized prompt set. Key findings include:
- Quantitative Segmentation Accuracy: Selective aggregation outperforms DAAM at all confidence thresholds (e.g., $0.7765$ vs $0.7540$ mean IoU at a $0.4$ threshold), demonstrating the advantage of head selection.
- Ablation on Relevance: Aggregating only the 30 most relevant heads yields significantly higher IoU than aggregating the 30 least relevant heads, supporting the claim that concept specificity is critical for interpretability.
Figure 1: Qualitative comparison of attention-based object segmentation using DAAM (all heads) versus selective head aggregation, with red circles marking regions of suboptimal attention focus in the DAAM method.
- Qualitative Segmentation: Figure 1 illustrates that selective head aggregation focuses more precisely on animal regions, whereas DAAM frequently mis-allocates attention, as visually highlighted.
Figure 2: Visualization comparing attention aggregation from the 30 most versus 30 least concept-relevant heads, showing critical differences in the spatial accuracy and specificity of focus regions.
- Disambiguation and Error Diagnosis: The selective aggregation technique is also effective at diagnosing prompt misinterpretations due to lexical ambiguity. For the ambiguous prompt "A mouse and other office objects are on the desk," concept-specific aggregation splits attention across object meanings—animal versus electronics—unlike DAAM, which averages both, obscuring the underlying disambiguation.
Figure 3: Analysis of the ambiguous prompt 'mouse,' where separate aggregations over animal-relevant and electronics-relevant heads visually disentangle dual interpretations; in contrast, DAAM conflates both.
- Head Count Sensitivity: An ablation study reveals optimal performance when 30 of 128 heads (≈23%) are selected for aggregation, denoting that excessive inclusion dilutes concept specificity, while too few heads diminishes spatial coverage.
Analysis
The core findings establish several insights with direct implications:
- Selective Aggregation: The most relevant heads concentrate semantic attention spatially and semantically, increasing both quantitative and qualitative interpretability metrics. This provides a more faithful representation of model concept binding than all-head averaging.
- Head Specialization: There exists a pronounced variance in head utility for different concepts; indiscriminate adaptation or guidance techniques that fail to respect head specialization are empirically suboptimal.
- Interpretability and Debugging: The approach enables concrete diagnosis of model confusion and prompt ambiguity by orthogonally visualizing different semantic head groups, supporting actionable model analysis and prompt engineering.
Implications and Future Directions
Practical Implications: The enhanced interpretability facilitates improved segmentation, image editing, and model debugging workflows. Since attention maps are critical primitives in the growing literature on controllable T2I generation [hertz2022prompt, chefer2023attend, ahn2025fine], selective aggregation generalizes as a tool for more reliable concept binding and visual explanation.
Theoretical Impact: This work reinforces the compositional nature of cross-attention representations and the importance of head specialization, motivating future analyses of architectural selection, head re-weighting, or pruning for improved T2I controllability and transparency.
Extension: While this study benchmarks only on 'Animals' and Stable Diffusion v1.4, the paradigm is applicable to other architectures (e.g. SDXL [podell2024sdxl], Imagen [saharia2022photorealistic]) and concepts, as supported by existing evidence on the universal specialization of attention mechanisms [ahn2025fine, park2024cross].
Future work should extend the evaluation across more diverse concepts and architectures, as well as explore joint optimization schemes for concept-specific head selection and aggregation.
Conclusion
This study demonstrates that selective aggregation of cross-attention maps, guided by relevance analysis, enhances diffusion-based visual interpretation in T2I models. The approach achieves higher alignment with semantic segmentation ground-truths and affords improved qualitative diagnosis of model behavior. These findings support a shift from undifferentiated attention map aggregation towards concept-aware interpretability workflows, with implications for more precise, controllable, and transparent visual generation.