SpotFusion: Dynamic Feature Fusion in SpotEdit

Updated 30 December 2025

SpotFusion is a dynamic feature fusion mechanism in SpotEdit that adaptively merges cached and condition image features to preserve global coherence during selective image editing.
It employs a cosine-squared fusion coefficient to interpolate key and value vectors, ensuring that revised tokens remain aligned with static regions to minimize boundary artifacts.
Empirical results demonstrate that SpotFusion improves image fidelity and speeds up processing (1.7×–1.9×) compared to naïve skipping or static fusion methods.

SpotFusion is a dynamic feature fusion mechanism central to SpotEdit, a framework designed for selective region editing in diffusion transformer models. SpotFusion enables efficient and high-fidelity editing of images by adaptively blending features from condition images and evolving edited tokens, preserving both the fidelity of unedited regions and global contextual coherence throughout the transformer denoising process (Qin et al., 26 Dec 2025).

1. Role Within SpotEdit and Editing Pipeline Segmentation

SpotEdit structures its editing pipeline in three sequential phases:

Initial Full-Image Denoising (Initial Steps): The model denoises the entire image, caching Key/Value (KV) pairs for both the current latent and condition image streams across all transformer blocks.
Selective Denoising (Spot Steps): SpotSelector, a perceptual similarity-based classifier, splits spatial tokens into non-edited regions (to be skipped) and regions to be regenerated. Only the latter are actively updated. Direct token omission, however, would break the cross-token attention required for coherent outputs.
Final Token Replacement: Before decoding, the system copies non-edited latent tokens back from the condition image, ensuring perfect reconstruction in those areas.

SpotFusion operates in the Selective Denoising phase, solving the critical problem of restoring contextual signals for regenerated tokens by reintroducing and adaptively updating the KV features of non-edited regions. This maintains transformer attention coherence and prevents artifacts at the boundary between edited and static regions (Qin et al., 26 Dec 2025).

2. Mathematical Formulation of the SpotFusion Mechanism

For each timestep $t \in [0,1]$ (normalized such that $t=1$ at pure noise and $t=0$ at full denoising), SpotSelector divides tokens into non-edited indices $\mathcal{R}_t$ and regenerated indices $\mathcal{A}_t$ . For every transformer block $b$ , SpotFusion computes fused KV representations for non-edited tokens $i \in \mathcal{R}_t$ using:

$K^{(b)}_{t,i} \gets \alpha(t) K^{(b)}_{t+\Delta t, i} + (1-\alpha(t)) K^{(b)}_{y, i}$

$V^{(b)}_{t,i} \gets \alpha(t) V^{(b)}_{t+\Delta t, i} + (1-\alpha(t)) V^{(b)}_{y, i}$

where $K^{(b)}_{t+\Delta t, i}$ and $V^{(b)}_{t+\Delta t, i}$ are cached KV vectors from the previous timestep for non-edited tokens, and $K^{(b)}_{y,i}$ , $V^{(b)}_{y,i}$ are from the pure condition image.

The fusion coefficient is scheduled as:

$\alpha(t) = \cos^2\! \left( \frac{\pi}{2} t \right) \in [0,1]$

As $t \rightarrow 0$ , the emphasis shifts from historical KVs toward condition image KVs, ensuring temporal alignment between evolving and static regions. During multi-headed self-attention, queries for active (regenerated) tokens are formed as usual, while keys and values for skipped regions are constructed from the dynamically fused KVs, preserving their contextual influence (Qin et al., 26 Dec 2025).

3. Stepwise Application at Diffusion Timesteps

The per-timestep SpotFusion workflow within SpotEdit proceeds as follows:

Token Partitioning: SpotSelector receives the latest reconstructed latent $\hat{X}_0$ , the condition image latent $Y$ , and threshold $\tau$ to output non-edited ( $\mathcal{R}_t$ ) and regenerated ( $\mathcal{A}_t$ ) token sets.
KV Fusion and Caching:
- For each transformer block $b$ $b$ and each token $i \in \mathcal{R}_t$ $i \in R_{t}$ :
  - Retrieve previous cached KV vectors and pure-condition KVs.
  - Compute fused KVs via $\alpha(t)$ -weighted interpolation.
  - Store fused KVs in cache for this step.
Attention Construction:
- $Q_{\text{active}} =$ concatenated queries for prompt and regenerated tokens.
- $K_{\text{full}}$ , $V_{\text{full}}$ = concatenated keys and values from prompt, active, fused non-edited, and condition-y branches.
- Apply multi-head attention for active queries only.
ODE Update: Only regenerated token slots are updated via the diffusion ODE step.

This stepwise approach ensures continual contextual input from static regions while limiting computation to only the necessary tokens.

4. Contextual Coherence and Empirical Results

Experimental ablation in (Qin et al., 26 Dec 2025) demonstrates that:

Naïve skipping of non-edited tokens (without KV reuse) disconnects the global attention, resulting in edge artifacts and visible discontinuities at edit region boundaries.
Static KV reuse (no interpolation) causes contextual drift, producing blur and color mismatch artifacts over iterative steps.
Dynamic SpotFusion resolves both issues; it preserves edge continuity by providing historically-evolved context early in denoising, but gracefully ensures strict fidelity to the condition image in non-edited regions as denoising concludes.

Quantitative results (Table S3) show that omitting SpotFusion reduces SSIM $_c$ by approximately 0.10 and PSNR by around 2 dB. Static fusion (no refreshing) reduces DISTS and introduces artifacts. In contrast, SpotFusion recovers or slightly exceeds full-denoise baseline fidelity and delivers substantial speedup (1.7×–1.9×) (Qin et al., 26 Dec 2025).

5. Hyperparameters, Engineering Details, and Practical Guidance

Key SpotFusion parameters and practical considerations include:

Parameter	Setting	Notes
Total diffusion steps	$T=50$	$K_{\text{init}}=4$ full, $T-K_{\text{init}}=46$ selective
SpotSelector threshold	$\tau=0.2$
Fusion schedule	$\alpha(t)=\cos^2(\frac{\pi}{2} t)$	$t \in [0,1]$ ; dynamically interpolates KVs
KV caching	Both evolving (non-edited) and reference KVs	Maintains context and ground-truth fidelity
Periodic condition reset	Every 10 selective steps	Prevents numerical drift, avoids PSNR collapse (~1.6 dB loss if omitted)
Implementation environment	NVIDIA H200, CUDA 12.8, PyTorch 2.9	Image resolution 1024×1024, random seed 42

A plausible implication is that omitting the periodic reset mechanism accelerates runtime (2.25× vs. 1.95× speedup), but at the cost of non-trivial fidelity degradation.

6. Significance and Impact in Diffusion-Based Editing

SpotFusion enables selective, region-wise editing without sacrificing the global context required for semantically and visually coherent image synthesis. By maintaining dynamic compatibility between evolving edited token states and stable, reference-grounded condition streams, SpotFusion eliminates boundary artifacts and preserves high-frequency detail in unmodified regions. This underpins both the empirical speedup and quality retention of SpotEdit relative to conventional uniform denoising, and establishes a principled method for efficient localized editing in diffusion transformer architectures (Qin et al., 26 Dec 2025).

Markdown Upgrade to Chat

References (1)

SpotEdit: Selective Region Editing in Diffusion Transformers (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SpotFusion.

SpotFusion: Dynamic Feature Fusion in SpotEdit

1. Role Within SpotEdit and Editing Pipeline Segmentation

2. Mathematical Formulation of the SpotFusion Mechanism

3. Stepwise Application at Diffusion Timesteps

4. Contextual Coherence and Empirical Results

5. Hyperparameters, Engineering Details, and Practical Guidance

6. Significance and Impact in Diffusion-Based Editing

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

SpotFusion: Dynamic Feature Fusion in SpotEdit

1. Role Within SpotEdit and Editing Pipeline Segmentation

2. Mathematical Formulation of the SpotFusion Mechanism

3. Stepwise Application at Diffusion Timesteps

4. Contextual Coherence and Empirical Results

5. Hyperparameters, Engineering Details, and Practical Guidance

6. Significance and Impact in Diffusion-Based Editing

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research