InfSplign: Inference-Time Spatial Alignment

Updated 26 December 2025

InfSplign is an inference-time method that enhances spatial alignment in text-to-image diffusion models through cross-attention derived loss functions.
It computes spatial, presence, and balance losses from cross-attention maps to guide the denoising process, ensuring objects are placed according to textual prompts.
Empirical evaluations show significant improvements in spatial accuracy over baselines, though the method is limited to four spatial relations and requires initial object presence.

InfSplign is an inference-time method for enhancing spatial alignment in text-to-image (T2I) diffusion models. It addresses the persistent challenge where diffusion models—such as Stable Diffusion—fail to render objects in spatial configurations explicitly dictated by textual prompts. By leveraging cross-attention maps within pretrained models, InfSplign introduces a compound loss to guide image generation during sampling, yielding improvements in spatial accuracy without requiring retraining or additional supervision (Rastegar et al., 19 Dec 2025).

1. Motivation and Challenges in Spatial Alignment

Diffusion-based T2I models excel at photorealism but often produce images violating spatial relationships specified in prompts, such as rendering “A cat to the left of a motorcycle” as “a cat to the right of a motorcycle,” or co-locating or omitting objects entirely. These failures are consequential in domains where spatial correctness is critical, including robotic scene synthesis, AR content placement, and graphical layout automation.

Underlying causes include: (i) training regimes lacking fine-grained spatial annotations (CLIP-style encoders supervise global image-text matching, not relative layouts); (ii) textual embeddings poorly separating spatial semantics (“left” vs. “right,” etc.); and (iii) the reliance of existing solutions on either computationally expensive fine-tuning (e.g., CoMPaSS, SPRIGHT) or inference-time systems that require auxiliary inputs or complex scene understanding pipelines (e.g., conditional layout maps, LLM-based guidance).

2. Methodological Framework

InfSplign is a training-free wrapper for any pretrained T2I diffusion backbone (including Stable Diffusion v1.4, v2.1, SDXL) and operates exclusively at inference. At every reverse diffusion step $t$ , it performs the following pipeline:

Given latent $z_t$ $z_{t}$ and a prompt $\mathcal{P} = \langle A, R, B\rangle$ $P = ⟨ A, R, B ⟩$ (where %%%%10%%%% and $B$ $B$ are object tokens and $R$ $R$ is a spatial relation: left, right, above, below), a U-Net forward pass yields:
- The predicted noise term $\epsilon_\theta(z_t; t, y)$ under classifier-free guidance.
- Cross-attention maps $\mathcal{A}_t^{(l)}$ , $\mathcal{B}_t^{(l)}$ for tokens $A$ and $B$ at decoder layers $l$ .
Three loss terms—spatial, presence, and balance—are computed from the attention maps.
The total InfSplign loss $\mathcal{L}_\text{InfSplign}$ is backpropagated to $z_t$ to obtain a gradient $\nabla_{z_t}\mathcal{L}_\text{InfSplign}$ .
The predicted noise is adjusted using:

$\epsilon_t \gets \epsilon_\theta(z_t; t) + \gamma\bigl(\epsilon_\theta(z_t; t, y) - \epsilon_\theta(z_t; t)\bigr) + \eta\,\nabla_{z_t}\mathcal{L}_\text{InfSplign}$

with CFG scale $\gamma$ (e.g., 7.5) and InfSplign guidance weight $\eta$ (e.g., 1000).

The latent is updated as $z_{t-1} = z_t - s_t \epsilon_t$ .

Gradients from the compound loss guide the next denoising step, directly steering object placement without modifying model weights. This process exploits the correlation between cross-attention localization and image regions corresponding to tokenized objects.

3. Compound Loss Construction

InfSplign’s loss function integrates spatial placement, object presence, and balance constraints derived from cross-attention statistics. Cross-attention maps are parsed at coarse (decoder block 1, layers 1–3) and mid-level (decoder block 2, layers 1–3) decoder layers.

Centroid and Variance Computation: For each token and layer, the map $\mathcal{A}_t^{(l)} \in \mathbb{R}^{H \times W}$ yields a centroid:

$c_A^{(l)} = (x_A, y_A) = \left(\frac{\sum_{h,w} \mathcal{A}_t^{(l)}[h,w]\, x_w}{\sum_{h,w} \mathcal{A}_t^{(l)}[h,w]},\, \frac{\sum_{h,w} \mathcal{A}_t^{(l)}[h,w]\, y_h}{\sum_{h,w} \mathcal{A}_t^{(l)}[h,w]}\right)$

and spatial variance:

$\sigma_A^{2(l)} = \frac{\sum_{h,w} \mathcal{A}_t^{(l)}[h,w]\, \| (x_w, y_h) - c_A^{(l)} \|^2}{\sum_{h,w} \mathcal{A}_t^{(l)}[h,w] }$

Spatial Alignment Loss: For the axis dictated by $R$ , use the signed centroid difference $\Delta$ . With margin $m$ and steepness $\alpha$ :

$\mathcal{L}_\text{spatial} = f_\text{spatial} \bigl( \alpha (m - \Delta)\bigr)$

where $f_\text{spatial}$ is GeLU.

Object Presence Loss:

$\mathcal{L}_\text{presence} = \sum_{X \in \{A, B\}} \sum_{l \in \text{coarse}} \sigma_X^{2(l)}$

Representation Balance Loss:

$\mathcal{L}_\text{balance} = \sum_{l \in \text{mid}} \bigl| \sigma_A^{2(l)} - \sigma_B^{2(l)}\bigr|$

Total Loss:

$\mathcal{L}_\text{InfSplign} = \lambda_s\,\mathcal{L}_\text{spatial} + \lambda_p\,\mathcal{L}_\text{presence} + \lambda_b\,\mathcal{L}_\text{balance}$

Typical weights: $\lambda_s = 0.5$ , $\lambda_p = 1.0$ , $\lambda_b = 0.5$ for SD v1.4; $\lambda_b = 1.0$ for SD v2.1.

4. Algorithmic and Implementation Details

InfSplign requires only three additional modules—an AttentionExtractor (for cross-attention maps), LossComputer (for $\mathcal{L}_\text{InfSplign}$ ), and NoiseAdjuster (for noise vector modification)—to wrap the standard denoising loop. Sampling uses 50 inference steps, CFG scale $\gamma = 7.5$ , and $\eta = 1000$ . Computational overhead is an additional 10–15% sampling time, dominated by backward passes for $\nabla_{z_t}\mathcal{L}_\text{InfSplign}$ .

Pseudocode (for a single reverse step):

Input: prompt P=<A, R, B>, latent z_t, model ε_θ, U-Net yielding attention maps
1. (ε_uncond, ε_cond, {A_t^{(l)}, B_t^{(l)}}) ← U-Net(z_t; t, P)
2. Compute centroids c_A, c_B via Eq.(2)
3. Compute variances σ_A², σ_B² via Eq.(3)
4. Δ ← difference(c_A, c_B, R) via Eq.(4)
5. L_spatial ← f_spatial(α (m−Δ))
6. L_presence ← ∑coarse(σ_A²+σ_B²)
7. L_balance ← ∑mid |σ_A²−σ_B²|
8. L ← λ_s L_spatial + λ_p L_presence + λ_b L_balance
9. ε_t ← ε_uncond + γ(ε_cond−ε_uncond) + η ∇_{z_t}L
10. z_{t−1} ← z_t − s_t ε_t
return z_{t−1}

The pipeline does not alter pretrained weights and is compatible with any text-conditioned U-Net.

5. Empirical Evaluation

Quantitative results on two primary spatial T2I benchmarks demonstrate substantial improvements.

VISOR Benchmark (MS-COCO Spatial Pairs)

Model	OA	VISOR₄
SD v1.4 Baseline	29.86%	1.63%
STORM (best prior)	61.01%	25.70%
InfSplign (v1.4)	67.36%	36.54%
SD v2.1 Baseline	47.83%	4.70%
STORM (v2.1)	62.55%	25.42%
InfSplign (v2.1)	77.28%	50.23%

T2I-CompBench (Compositional Spatial Subtask)

Model	Mean Spatial Accuracy
SD v1.4 Baseline	0.1246
STORM	0.1613
InfSplign (v1.4)	0.3771
SD v2.1 Baseline	0.1342
STORM	0.1981
InfSplign (v2.1)	0.4172

On T2I-CompBench, InfSplign (v1.4) is +21.91% over STORM and +3.7% over CoMPaSS fine-tuned; InfSplign (v2.1) shows +21.91% over STORM and +9.7% over CoMPaSS (Rastegar et al., 19 Dec 2025). Qualitative examples show InfSplign correctly resolving many spatial misalignments that persist in both vanilla and baseline-enhanced outputs.

6. Limitations and Scope

InfSplign presupposes that the base diffusion model generates both specified objects; if an object is entirely missing, spatial alignment cannot be enforced. Its mechanism explicitly addresses four spatial relations (left, right, above, below). More intricate or multi-object spatial relations (“between,” “inside,” complex layouts) remain open challenges.

A plausible implication is that extension to a broader relationship set may require additional loss engineering or architectural support.

7. Broader Implications and Future Directions

InfSplign demonstrates that state-of-the-art spatial control in T2I diffusion can be achieved at inference by manipulating cross-attention-derived losses, obviating the need for retraining or auxiliary conditioning. The results suggest a general paradigm: pretrained diffusion models encode rich spatial information in cross-attention; inference-time guidance can unlock this capability for compositional image synthesis.

Future directions include generalizing the approach to Transformer backbones, handling richer relational and multi-object prompts, and exploring hybrid schemes—a combination of InfSplign with weak fine-tuning or LLM-based prompt transformation—for even more robust spatial control (Rastegar et al., 19 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to InfSplign.