InfSplign: Inference-Time Spatial Alignment
- InfSplign is an inference-time method that enhances spatial alignment in text-to-image diffusion models through cross-attention derived loss functions.
- It computes spatial, presence, and balance losses from cross-attention maps to guide the denoising process, ensuring objects are placed according to textual prompts.
- Empirical evaluations show significant improvements in spatial accuracy over baselines, though the method is limited to four spatial relations and requires initial object presence.
InfSplign is an inference-time method for enhancing spatial alignment in text-to-image (T2I) diffusion models. It addresses the persistent challenge where diffusion models—such as Stable Diffusion—fail to render objects in spatial configurations explicitly dictated by textual prompts. By leveraging cross-attention maps within pretrained models, InfSplign introduces a compound loss to guide image generation during sampling, yielding improvements in spatial accuracy without requiring retraining or additional supervision (Rastegar et al., 19 Dec 2025).
1. Motivation and Challenges in Spatial Alignment
Diffusion-based T2I models excel at photorealism but often produce images violating spatial relationships specified in prompts, such as rendering “A cat to the left of a motorcycle” as “a cat to the right of a motorcycle,” or co-locating or omitting objects entirely. These failures are consequential in domains where spatial correctness is critical, including robotic scene synthesis, AR content placement, and graphical layout automation.
Underlying causes include: (i) training regimes lacking fine-grained spatial annotations (CLIP-style encoders supervise global image-text matching, not relative layouts); (ii) textual embeddings poorly separating spatial semantics (“left” vs. “right,” etc.); and (iii) the reliance of existing solutions on either computationally expensive fine-tuning (e.g., CoMPaSS, SPRIGHT) or inference-time systems that require auxiliary inputs or complex scene understanding pipelines (e.g., conditional layout maps, LLM-based guidance).
2. Methodological Framework
InfSplign is a training-free wrapper for any pretrained T2I diffusion backbone (including Stable Diffusion v1.4, v2.1, SDXL) and operates exclusively at inference. At every reverse diffusion step , it performs the following pipeline:
- Given latent and a prompt (where %%%%10%%%% and are object tokens and is a spatial relation: left, right, above, below), a U-Net forward pass yields:
- The predicted noise term under classifier-free guidance.
- Cross-attention maps , for tokens and at decoder layers .
- Three loss terms—spatial, presence, and balance—are computed from the attention maps.
- The total InfSplign loss is backpropagated to to obtain a gradient .
- The predicted noise is adjusted using:
with CFG scale (e.g., 7.5) and InfSplign guidance weight (e.g., 1000).
- The latent is updated as .
Gradients from the compound loss guide the next denoising step, directly steering object placement without modifying model weights. This process exploits the correlation between cross-attention localization and image regions corresponding to tokenized objects.
3. Compound Loss Construction
InfSplign’s loss function integrates spatial placement, object presence, and balance constraints derived from cross-attention statistics. Cross-attention maps are parsed at coarse (decoder block 1, layers 1–3) and mid-level (decoder block 2, layers 1–3) decoder layers.
- Centroid and Variance Computation: For each token and layer, the map yields a centroid:
and spatial variance:
- Spatial Alignment Loss: For the axis dictated by , use the signed centroid difference . With margin and steepness :
where is GeLU.
- Object Presence Loss:
- Representation Balance Loss:
- Total Loss:
Typical weights: , , for SD v1.4; for SD v2.1.
4. Algorithmic and Implementation Details
InfSplign requires only three additional modules—an AttentionExtractor (for cross-attention maps), LossComputer (for ), and NoiseAdjuster (for noise vector modification)—to wrap the standard denoising loop. Sampling uses 50 inference steps, CFG scale , and . Computational overhead is an additional 10–15% sampling time, dominated by backward passes for .
Pseudocode (for a single reverse step):
1 2 3 4 5 6 7 8 9 10 11 12 |
Input: prompt P=<A, R, B>, latent z_t, model ε_θ, U-Net yielding attention maps 1. (ε_uncond, ε_cond, {A_t^{(l)}, B_t^{(l)}}) ← U-Net(z_t; t, P) 2. Compute centroids c_A, c_B via Eq.(2) 3. Compute variances σ_A², σ_B² via Eq.(3) 4. Δ ← difference(c_A, c_B, R) via Eq.(4) 5. L_spatial ← f_spatial(α (m−Δ)) 6. L_presence ← ∑coarse(σ_A²+σ_B²) 7. L_balance ← ∑mid |σ_A²−σ_B²| 8. L ← λ_s L_spatial + λ_p L_presence + λ_b L_balance 9. ε_t ← ε_uncond + γ(ε_cond−ε_uncond) + η ∇_{z_t}L 10. z_{t−1} ← z_t − s_t ε_t return z_{t−1} |
5. Empirical Evaluation
Quantitative results on two primary spatial T2I benchmarks demonstrate substantial improvements.
VISOR Benchmark (MS-COCO Spatial Pairs)
| Model | OA | VISOR₄ |
|---|---|---|
| SD v1.4 Baseline | 29.86% | 1.63% |
| STORM (best prior) | 61.01% | 25.70% |
| InfSplign (v1.4) | 67.36% | 36.54% |
| SD v2.1 Baseline | 47.83% | 4.70% |
| STORM (v2.1) | 62.55% | 25.42% |
| InfSplign (v2.1) | 77.28% | 50.23% |
T2I-CompBench (Compositional Spatial Subtask)
| Model | Mean Spatial Accuracy |
|---|---|
| SD v1.4 Baseline | 0.1246 |
| STORM | 0.1613 |
| InfSplign (v1.4) | 0.3771 |
| SD v2.1 Baseline | 0.1342 |
| STORM | 0.1981 |
| InfSplign (v2.1) | 0.4172 |
On T2I-CompBench, InfSplign (v1.4) is +21.91% over STORM and +3.7% over CoMPaSS fine-tuned; InfSplign (v2.1) shows +21.91% over STORM and +9.7% over CoMPaSS (Rastegar et al., 19 Dec 2025). Qualitative examples show InfSplign correctly resolving many spatial misalignments that persist in both vanilla and baseline-enhanced outputs.
6. Limitations and Scope
InfSplign presupposes that the base diffusion model generates both specified objects; if an object is entirely missing, spatial alignment cannot be enforced. Its mechanism explicitly addresses four spatial relations (left, right, above, below). More intricate or multi-object spatial relations (“between,” “inside,” complex layouts) remain open challenges.
A plausible implication is that extension to a broader relationship set may require additional loss engineering or architectural support.
7. Broader Implications and Future Directions
InfSplign demonstrates that state-of-the-art spatial control in T2I diffusion can be achieved at inference by manipulating cross-attention-derived losses, obviating the need for retraining or auxiliary conditioning. The results suggest a general paradigm: pretrained diffusion models encode rich spatial information in cross-attention; inference-time guidance can unlock this capability for compositional image synthesis.
Future directions include generalizing the approach to Transformer backbones, handling richer relational and multi-object prompts, and exploring hybrid schemes—a combination of InfSplign with weak fine-tuning or LLM-based prompt transformation—for even more robust spatial control (Rastegar et al., 19 Dec 2025).