Rectified Cross-Attention in Diffusion Models
- RCA is a modification to cross-attention that uses semantic masks to enforce strict token-to-region binding in layout-to-image synthesis.
- It rectifies attention maps via hard or soft masking, ensuring that text tokens modulate only the designated image regions.
- Empirical evaluations on datasets like COCO-Stuff and ADE20K demonstrate enhanced compositional control, improved image quality, and increased diversity compared to baseline methods.
Rectified Cross-Attention (RCA) is a structural modification to the cross-attention mechanism in diffusion models, specifically devised to enforce a precise spatial binding between text token semantics and designated image regions during layout-to-image synthesis. RCA operates as a zero-parameter "plug-in" that systematically rectifies the attention maps in every cross-attention layer, ensuring each text token can only modulate pixels within its allocated region. This mechanism enables the generation of images grounded in arbitrary spatial layouts, supporting both in-distribution classes and the composition of "freestyle" scenes with unseen semantics—capabilities central to freestyle layout-to-image synthesis (FLIS) and evaluated in tasks such as those on COCO-Stuff and ADE20K (Xue et al., 2023).
1. Standard Cross-Attention in Diffusion U-Nets
In the baseline diffusion U-Net architecture for layout-to-image synthesis, cross-attention mediates the integration of visual and linguistic representations. At a hidden layer, the image feature map is projected into queries , while a set of text token embeddings are projected to keys and values using learned linear maps: The unnormalized attention score for token and spatial position is: Attention weights are computed via a spatial softmax over for each token, and the output is formed by reprojecting the attended values: This formulation enables free flow of information between all text tokens and all pixels, potentially conflating semantics across disparate regions.
2. RCA Formulation and Mechanism
Rectified Cross-Attention modifies the standard attention computation by enforcing a strict or relaxed masking between tokens and spatial positions according to a prescribed semantic layout. For semantic regions, each with a binary mask , a text token (aligned with concept ) is associated with mask . The pre-softmax attention logit is rectified:
- Hard mask:
- Soft mask:
Rectified scores are then normalized: and the cross-attended output is (Xue et al., 2023).
Pseudocode for the RCA layer precisely reflects these operations, ensuring explicit token-to-region correspondence at every layer.
3. Semantic Mask Representation and Alignment
For each (image, layout, text) triplet:
- The semantic layout is given as a one-hot map , expanded to binary masks .
- Text concepts corresponding to the semantic regions are concatenated into a sentence, tokenized, and encoded (e.g., via a frozen CLIP-text encoder), yielding token embeddings.
- Tokens are aligned: the first tokens correspond to the mask channels; supplementary, special, or padding tokens are assigned an all-ones mask to act globally.
- At each cross-attention layer, the scores are masked before spatial softmax, enforcing nonzero attention only for pixels within the designated region per token.
4. Training Objective and Optimization
No additional loss terms are introduced by RCA. The system is trained end-to-end using the standard latent-diffusion denoising loss: where represents the noisy latent at time , is the input text, the layout, is a frozen text encoder (CLIP), and is the U-Net (Xue et al., 2023).
5. Empirical Evaluation and Results
RCA, integrated into FreestyleNet, demonstrates state-of-the-art performance on COCO-Stuff and ADE20K across established metrics:
| Method | COCO FID ↓ | COCO mIoU ↑ | ADE20K FID ↓ | ADE20K mIoU ↑ |
|---|---|---|---|---|
| Pix2PixHD | 111.5 | 14.6 | 81.8 | 20.3 |
| SPADE | 22.6 | 37.4 | 33.9 | 38.5 |
| CC-FPSE | 19.2 | 41.6 | 31.7 | 43.7 |
| OASIS | 17.0 | 44.1 | 28.3 | 48.8 |
| SC-GAN | 18.1 | 42.0 | 29.3 | 45.2 |
| PITI | 16.1 | 34.1 | 27.9 | 29.4 |
| FreestyleNet (RCA) | 14.4 | 40.7 | 25.0 | 41.9 |
FreestyleNet with RCA achieves the best FID (14.4 on COCO, 25.0 on ADE20K) and competitive mIoU (40.7 on COCO, 41.9 on ADE20K). Diversity, measured by LPIPS, is also enhanced: 0.592 (COCO) and 0.591 (ADE20K), outperforming prior methods such as PITI, OASIS, and CC-FPSE.
Ablation experiments reveal that swapping rectified attention maps between tokens allows local style transfer (e.g., region-wise application of "Van Gogh" style or object appearance exchange), demonstrating the efficacy and granularity of the region-token coupling.
6. Limitations and Prospective Extensions
- Infrequent or counter-factual scenes (e.g., "a marshmallow in the sky") may not synthesize realistically, due to limited priors during pre-training or overfitting from fine-tuning.
- Strict binary masking may be overly rigid. A relaxation is possible by substituting with large negative bias or by adopting soft masks , allowing controlled cross-region leakage.
- Current alignment requires a predetermined set of categories; further work could explore automatic mapping of arbitrary phrases to region tokens or prompt-tuning to obviate manual token–mask correspondence.
- Employing more advanced pre-trained diffusion priors, such as Imagen, and optimizing fine-tuning strategies that preserve generative capacity may further improve in-distribution and compositional generalization performance.
7. Comparative Context and Broader Significance
The RCA principle is one concrete instantiation of the more general "rectification" of cross-attention, which is also reflected in other mechanisms such as Dynamic Cross-Attention (DCA) for multi-modal systems (Praveen et al., 28 Mar 2024). While RCA enforces region-token binding via explicit spatial masking in the generative U-Net, related rectification strategies dynamically modulate cross-attention strength according to the complementarity of modalities, as in the gating procedure for robust audio-visual fusion. This conceptual and practical alignment suggests the potential universality of rectified attention paradigms for fine-grained, context-sensitive information fusion across tasks in vision, language, and beyond.
RCA enables controlled freestyle generation by binding semantics to layout through structurally masked cross-attention, providing a foundation for compositional image synthesis, regional style transfer, and flexible semantic manipulation in generative modeling (Xue et al., 2023).