Papers
Topics
Authors
Recent
2000 character limit reached

InfSplign: Inference-Time Spatial Alignment

Updated 26 December 2025
  • InfSplign is an inference-time method that enhances spatial alignment in text-to-image diffusion models through cross-attention derived loss functions.
  • It computes spatial, presence, and balance losses from cross-attention maps to guide the denoising process, ensuring objects are placed according to textual prompts.
  • Empirical evaluations show significant improvements in spatial accuracy over baselines, though the method is limited to four spatial relations and requires initial object presence.

InfSplign is an inference-time method for enhancing spatial alignment in text-to-image (T2I) diffusion models. It addresses the persistent challenge where diffusion models—such as Stable Diffusion—fail to render objects in spatial configurations explicitly dictated by textual prompts. By leveraging cross-attention maps within pretrained models, InfSplign introduces a compound loss to guide image generation during sampling, yielding improvements in spatial accuracy without requiring retraining or additional supervision (Rastegar et al., 19 Dec 2025).

1. Motivation and Challenges in Spatial Alignment

Diffusion-based T2I models excel at photorealism but often produce images violating spatial relationships specified in prompts, such as rendering “A cat to the left of a motorcycle” as “a cat to the right of a motorcycle,” or co-locating or omitting objects entirely. These failures are consequential in domains where spatial correctness is critical, including robotic scene synthesis, AR content placement, and graphical layout automation.

Underlying causes include: (i) training regimes lacking fine-grained spatial annotations (CLIP-style encoders supervise global image-text matching, not relative layouts); (ii) textual embeddings poorly separating spatial semantics (“left” vs. “right,” etc.); and (iii) the reliance of existing solutions on either computationally expensive fine-tuning (e.g., CoMPaSS, SPRIGHT) or inference-time systems that require auxiliary inputs or complex scene understanding pipelines (e.g., conditional layout maps, LLM-based guidance).

2. Methodological Framework

InfSplign is a training-free wrapper for any pretrained T2I diffusion backbone (including Stable Diffusion v1.4, v2.1, SDXL) and operates exclusively at inference. At every reverse diffusion step tt, it performs the following pipeline:

  • Given latent ztz_t and a prompt P=A,R,B\mathcal{P} = \langle A, R, B\rangle (where %%%%10%%%% and BB are object tokens and RR is a spatial relation: left, right, above, below), a U-Net forward pass yields:
    • The predicted noise term ϵθ(zt;t,y)\epsilon_\theta(z_t; t, y) under classifier-free guidance.
    • Cross-attention maps At(l)\mathcal{A}_t^{(l)}, Bt(l)\mathcal{B}_t^{(l)} for tokens AA and BB at decoder layers ll.
  • Three loss terms—spatial, presence, and balance—are computed from the attention maps.
  • The total InfSplign loss LInfSplign\mathcal{L}_\text{InfSplign} is backpropagated to ztz_t to obtain a gradient ztLInfSplign\nabla_{z_t}\mathcal{L}_\text{InfSplign}.
  • The predicted noise is adjusted using:

ϵtϵθ(zt;t)+γ(ϵθ(zt;t,y)ϵθ(zt;t))+ηztLInfSplign\epsilon_t \gets \epsilon_\theta(z_t; t) + \gamma\bigl(\epsilon_\theta(z_t; t, y) - \epsilon_\theta(z_t; t)\bigr) + \eta\,\nabla_{z_t}\mathcal{L}_\text{InfSplign}

with CFG scale γ\gamma (e.g., 7.5) and InfSplign guidance weight η\eta (e.g., 1000).

  • The latent is updated as zt1=ztstϵtz_{t-1} = z_t - s_t \epsilon_t.

Gradients from the compound loss guide the next denoising step, directly steering object placement without modifying model weights. This process exploits the correlation between cross-attention localization and image regions corresponding to tokenized objects.

3. Compound Loss Construction

InfSplign’s loss function integrates spatial placement, object presence, and balance constraints derived from cross-attention statistics. Cross-attention maps are parsed at coarse (decoder block 1, layers 1–3) and mid-level (decoder block 2, layers 1–3) decoder layers.

  • Centroid and Variance Computation: For each token and layer, the map At(l)RH×W\mathcal{A}_t^{(l)} \in \mathbb{R}^{H \times W} yields a centroid:

cA(l)=(xA,yA)=(h,wAt(l)[h,w]xwh,wAt(l)[h,w],h,wAt(l)[h,w]yhh,wAt(l)[h,w])c_A^{(l)} = (x_A, y_A) = \left(\frac{\sum_{h,w} \mathcal{A}_t^{(l)}[h,w]\, x_w}{\sum_{h,w} \mathcal{A}_t^{(l)}[h,w]},\, \frac{\sum_{h,w} \mathcal{A}_t^{(l)}[h,w]\, y_h}{\sum_{h,w} \mathcal{A}_t^{(l)}[h,w]}\right)

and spatial variance:

σA2(l)=h,wAt(l)[h,w](xw,yh)cA(l)2h,wAt(l)[h,w]\sigma_A^{2(l)} = \frac{\sum_{h,w} \mathcal{A}_t^{(l)}[h,w]\, \| (x_w, y_h) - c_A^{(l)} \|^2}{\sum_{h,w} \mathcal{A}_t^{(l)}[h,w] }

  • Spatial Alignment Loss: For the axis dictated by RR, use the signed centroid difference Δ\Delta. With margin mm and steepness α\alpha:

Lspatial=fspatial(α(mΔ))\mathcal{L}_\text{spatial} = f_\text{spatial} \bigl( \alpha (m - \Delta)\bigr)

where fspatialf_\text{spatial} is GeLU.

  • Object Presence Loss:

Lpresence=X{A,B}lcoarseσX2(l)\mathcal{L}_\text{presence} = \sum_{X \in \{A, B\}} \sum_{l \in \text{coarse}} \sigma_X^{2(l)}

  • Representation Balance Loss:

Lbalance=lmidσA2(l)σB2(l)\mathcal{L}_\text{balance} = \sum_{l \in \text{mid}} \bigl| \sigma_A^{2(l)} - \sigma_B^{2(l)}\bigr|

  • Total Loss:

LInfSplign=λsLspatial+λpLpresence+λbLbalance\mathcal{L}_\text{InfSplign} = \lambda_s\,\mathcal{L}_\text{spatial} + \lambda_p\,\mathcal{L}_\text{presence} + \lambda_b\,\mathcal{L}_\text{balance}

Typical weights: λs=0.5\lambda_s = 0.5, λp=1.0\lambda_p = 1.0, λb=0.5\lambda_b = 0.5 for SD v1.4; λb=1.0\lambda_b = 1.0 for SD v2.1.

4. Algorithmic and Implementation Details

InfSplign requires only three additional modules—an AttentionExtractor (for cross-attention maps), LossComputer (for LInfSplign\mathcal{L}_\text{InfSplign}), and NoiseAdjuster (for noise vector modification)—to wrap the standard denoising loop. Sampling uses 50 inference steps, CFG scale γ=7.5\gamma = 7.5, and η=1000\eta = 1000. Computational overhead is an additional 10–15% sampling time, dominated by backward passes for ztLInfSplign\nabla_{z_t}\mathcal{L}_\text{InfSplign}.

Pseudocode (for a single reverse step):

1
2
3
4
5
6
7
8
9
10
11
12
Input: prompt P=<A, R, B>, latent z_t, model ε_θ, U-Net yielding attention maps
1. (ε_uncond, ε_cond, {A_t^{(l)}, B_t^{(l)}})  U-Net(z_t; t, P)
2. Compute centroids c_A, c_B via Eq.(2)
3. Compute variances σ_A², σ_B² via Eq.(3)
4. Δ  difference(c_A, c_B, R) via Eq.(4)
5. L_spatial  f_spatial(α(mΔ))
6. L_presence  coarse(σ_A²+σ_B²)
7. L_balance  mid |σ_A²σ_B²|
8. L  λ_sL_spatial + λ_pL_presence + λ_bL_balance
9. ε_t  ε_uncond + γ(ε_condε_uncond) + η_{z_t}L
10. z_{t1}  z_t  s_tε_t
return z_{t1}
The pipeline does not alter pretrained weights and is compatible with any text-conditioned U-Net.

5. Empirical Evaluation

Quantitative results on two primary spatial T2I benchmarks demonstrate substantial improvements.

VISOR Benchmark (MS-COCO Spatial Pairs)

Model OA VISOR₄
SD v1.4 Baseline 29.86% 1.63%
STORM (best prior) 61.01% 25.70%
InfSplign (v1.4) 67.36% 36.54%
SD v2.1 Baseline 47.83% 4.70%
STORM (v2.1) 62.55% 25.42%
InfSplign (v2.1) 77.28% 50.23%

T2I-CompBench (Compositional Spatial Subtask)

Model Mean Spatial Accuracy
SD v1.4 Baseline 0.1246
STORM 0.1613
InfSplign (v1.4) 0.3771
SD v2.1 Baseline 0.1342
STORM 0.1981
InfSplign (v2.1) 0.4172

On T2I-CompBench, InfSplign (v1.4) is +21.91% over STORM and +3.7% over CoMPaSS fine-tuned; InfSplign (v2.1) shows +21.91% over STORM and +9.7% over CoMPaSS (Rastegar et al., 19 Dec 2025). Qualitative examples show InfSplign correctly resolving many spatial misalignments that persist in both vanilla and baseline-enhanced outputs.

6. Limitations and Scope

InfSplign presupposes that the base diffusion model generates both specified objects; if an object is entirely missing, spatial alignment cannot be enforced. Its mechanism explicitly addresses four spatial relations (left, right, above, below). More intricate or multi-object spatial relations (“between,” “inside,” complex layouts) remain open challenges.

A plausible implication is that extension to a broader relationship set may require additional loss engineering or architectural support.

7. Broader Implications and Future Directions

InfSplign demonstrates that state-of-the-art spatial control in T2I diffusion can be achieved at inference by manipulating cross-attention-derived losses, obviating the need for retraining or auxiliary conditioning. The results suggest a general paradigm: pretrained diffusion models encode rich spatial information in cross-attention; inference-time guidance can unlock this capability for compositional image synthesis.

Future directions include generalizing the approach to Transformer backbones, handling richer relational and multi-object prompts, and exploring hybrid schemes—a combination of InfSplign with weak fine-tuning or LLM-based prompt transformation—for even more robust spatial control (Rastegar et al., 19 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to InfSplign.