Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models

Published 29 Apr 2026 in cs.CV | (2604.26503v1)

Abstract: Diffusion models have achieved remarkable success in synthesizing complex static and temporal visuals, a breakthrough largely driven by Classifier-Free Guidance (CFG). However, despite its pivotal role in aligning generated content with textual prompts, standard CFG relies on a globally uniform scalar. This homogeneous amplification traps models in a well-documented "detail-artifact dilemma": low guidance scales fail to inject intricate semantics, while high scales inevitably cause structural degradation, color over-saturation, and temporal inconsistencies in videos. In this paper, we expose the physical root of this flaw through the lens of differential geometry. By analyzing Tweedie's Formula, we reveal that CFG intrinsically performs a tangential linear extrapolation. Because the natural data manifold is highly curved, this uniform linear step introduces a severe orthogonal deviation. To keep the generation trajectory safely bounded, we formulate a theoretical upper bound for spatial and adaptive guidance. Based on these geometric insights, we propose Spatial Adaptive Multi Guidance (SAMG), a training-free and virtually zero-cost sampling algorithm. SAMG dynamically computes point-wise conditional guidance energy, applying a conservative minimum scale to high-energy boundary regions to preserve delicate micro-textures, while deploying an aggressive maximum scale in low-energy regions to maximize semantic injection. Extensive experiments across diverse image (SD 1.5, SDXL, SD3.5 Medium) and video (CogVideoX, ModelScope) architectures demonstrate that SAMG effectively resolves the detail-artifact dilemma, achieving superior semantic alignment, structural integrity, and temporal smoothness without any computational overhead.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper presents SAMG, a novel algorithm that resolves the detail-artifact dilemma by adaptively modulating guidance based on pixel-wise energy levels.
It employs a negative-slope affine mapping for spatial energy normalization, ensuring semantic consistency and preservation of fine structural details.
Empirical evaluations across image and video synthesis tasks demonstrate superior performance in metrics like CLIPScore, FID, and SSIM compared to standard CFG.

Spatial Adaptive Multi Guidance: Resolving the Detail-Artifact Dilemma in Diffusion Sampling

Introduction

The proliferation of latent diffusion models (LDMs) and flow-based generative frameworks has precipitated a rapid advancement in high-fidelity image and video synthesis. A key enabler of these capabilities is Classifier-Free Guidance (CFG), which linearly extrapolates between unconditional and conditional predictions to amplify semantic alignment. However, the canonical global scaling of CFG introduces a persistent "detail-artifact dilemma": insufficient scale fails to inject intricate semantics, while excessive scale triggers structural collapse, color over-saturation, and temporal inconsistencies. The paper "Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models" (2604.26503) rigorously interrogates this fundamental flaw, offering a geometric analysis and proposing Spatial Adaptive Multi Guidance (SAMG)—a training-free, computationally trivial sampling algorithm that achieves point-wise adaptive modulation to resolve these trade-offs.

Geometric Analysis of CFG and Manifold Deviation

CFG is analytically construed as a tangential linear extrapolation on the data manifold, as dictated by the Tweedie formula and score-matching frameworks. The non-linear, highly curved structure of natural image manifolds means that a globally uniform guidance scale induces significant orthogonal deviation, particularly in high-energy (high-frequency, structurally complex) regions. The paper formalizes the magnitude of this local guidance via a spatial energy map $E_t(x)$ , computed as the pixel-wise squared delta score between conditional and unconditional predictions. This energy directly modulates manifold deviation, and the authors derive a theoretical upper bound for safe pixel-wise guidance:

$\omega_{ideal}(x) \le \sqrt{\frac{2\delta}{\kappa(x) E_t(x)}}$

Here, $\kappa(x)$ denotes local manifold curvature; regions of high curvature (micro-structures, texture boundaries) must be strictly bounded to prevent generation artifacts.

Figure 1: Visualization of the generated image and spatial guidance energy evolution, highlighting SAMG’s ability to accurately identify and preserve high-frequency micro-textures.

SAMG: A Practical, Zero-Cost Adaptive Modulation Scheme

Given the intractability of high-dimensional curvature estimation, SAMG relaxes the theoretical bound to a negative-slope affine mapping between normalized spatial energy and guidance scale within predefined safety bounds $[\omega_{\min}, \omega_{\max}]$ :

$\boldsymbol{\Omega}_{map}(x) = \omega_{\max} - \hat{E}_t(x) (\omega_{\max} - \omega_{\min})$

where $\hat{E}_t(x)$ is the normalized energy map. High-energy regions (structural boundaries) are assigned conservative minimum scales to rigorously preserve details, while low-energy areas receive aggressive maximum guidance to optimize semantic injection. Critically, spatial smoothing is explicitly precluded—pixel-wise independence is mathematically proven necessary to prevent energy leakage and violation of manifold bounds.

Figure 2: Decoupled latent energy evolution in SDXL channels, demonstrating functional separation of structural and color components and the sub-optimality of uniform global guidance.

Empirical Evaluation and Numerical Results

Comprehensive experiments are conducted on multiple model architectures (SD 1.5, SDXL, SD3.5 Medium, CogVideoX, ModelScope), leveraging diverse datasets (Pick-a-Pic, DrawBench, GenEval, COCO, ChronoMagic-Bench-150). SAMG consistently outperforms standard CFG and other adaptive guidance strategies (e.g., PAG, CFG++, CFG-Zero) according to quantitative metrics:

Image generation: SAMG yields improved CLIPScore, FID, Top- $k$ accuracy, and human preference scores, breaking the performance ceiling observed with globally tuned CFG.
Video generation: SAMG enhances semantic/temporal alignment (CLIP SIM, MTScore CLIP), frame visual quality (LPIPS, SSIM), and motion smoothness (CHScore Flow).
Figure 3: Qualitative comparison in video synthesis: SAMG simultaneously achieves sharp, temporally consistent subjects and preserves background micro-details, overcoming the CFG trade-off.

SAMG is shown to be complementary to advanced guidance schemes (e.g., combining with CFG++ recovers human preference lost to rigid manifold constraints), and generalizes seamlessly across architectural paradigms. Ablation confirms that kernel smoothing dramatically degrades performance due to energy leakage; strict pixel-wise mapping is indispensable.

Theoretical and Practical Implications

The geometric lens adopted in this work provides principled insight into the manifold-aware design of guidance strategies. SAMG’s pixel-wise energy bounding fundamentally prevents exponential global manifold divergence, ensuring high geometric fidelity through the entire generative trajectory. The analysis also demonstrates functional decoupling across latent channels, laying the theoretical foundation for future channel-wise adaptive guidance.

Practically, SAMG is essentially plug-and-play—training-free, computational costless, and architecture-agnostic. It enables robust semantic injection without compromising fine details, and its transferability to video synthesis tasks is empirically validated.

Limitations and Future Directions

SAMG’s reliance on latent spatial energy presents challenges in handling dense semantic overlaps, as multiple micro-structures may merge into a single high-energy region in low-resolution latent representations. In such cases, SAMG preserves overall structure but fails to semantically disentangle overlapping objects. The authors propose future integration of cross-attention priors or higher-resolution latent mappings to address this bottleneck.

Figure 4: Limitation—failure in dense semantic overlap: SAMG maintains structural integrity but cannot disentangle closely overlapping features, e.g., an earring fused with hair strands.

Furthermore, channel-wise adaptive modulation is a promising extension, based on empirical evidence of distinct energy dynamics in structural and color channels. Such granular guidance could further optimize the semantic-structure Pareto frontier.

Conclusion

This work establishes a rigorous geometric framework for understanding and bounding guidance in diffusion models, demonstrating both the necessity and efficacy of pixel-wise adaptive modulation. SAMG is computationally trivial, architecturally universal, and empirically superior for both text-to-image and text-to-video tasks. Its theoretical underpinnings and practical results not only eliminate the classical detail-artifact trade-off but also set a foundation for future research in channel-wise and cross-semantic adaptive guidance. Integration into distillation and teacher-free frameworks has the potential to elevate the expressive power and fidelity of next-generation generative models.