Papers
Topics
Authors
Recent
Search
2000 character limit reached

SageBwd: Backward Sampling & INT8 Optimization

Updated 16 January 2026
  • SageBwd is a dual-implementation technique that employs backward-only sampling for diffusion image editing and INT8 quantized backward attention for efficient training.
  • In SAGE, backward sampling uses DDIM inversion with self-attention guidance and loss correction to preserve structural fidelity during targeted modifications.
  • The INT8 attention kernel accelerates gradient computation by leveraging quantization, achieving up to 1.15× speedup with minimal memory overhead and accuracy loss.

SageBwd refers to two distinct but influential implementations in deep learning systems: the backward-only sampling procedure of SAGE (Self-Attention Guidance for image Editing) for diffusion-based image editing (Gomez-Trenado et al., 14 May 2025), and the low-bit backward-pass attention kernel in SageAttention3 for efficient attention gradient computation during model training (Zhang et al., 16 May 2025). Each manifestation applies specialized backward methodologies to optimize performance and address critical fidelity and efficiency concerns in its respective domain.

1. SageBwd in SAGE: Backward-Only Sampling for Diffusion Image Editing

The SAGE framework leverages DDIM (Deterministic Diffusion Implicit Models) inversion to enable efficient and targeted editing of real images using pre-trained diffusion models. SageBwd implements the backward-only chain, sampling from the inverted trajectory under a new prompt while enforcing structural preservation via self-attention guidance.

The DDIM inversion pass encodes the input image x0x_0 to a latent z0=Enc(x0)z_0 = \mathrm{Enc}(x_0) and computes a series of latents {ztinv}\{ z_t^\mathrm{inv} \}: zt+1inv=αt+1αt(ztinv1αtεθt(ztinv,pin))+1αt+1εθt(ztinv,pin)z_{t+1}^{\mathrm{inv}} = \frac{\sqrt{\alpha_{t+1}}}{\sqrt{\alpha_t}} \left(z_t^{\mathrm{inv}} - \sqrt{1-\alpha_t}\,\varepsilon_\theta^t(z_t^{\mathrm{inv}}, p_{\mathrm{in}})\right) + \sqrt{1-\alpha_{t+1}}\,\varepsilon_\theta^t(z_t^{\mathrm{inv}}, p_{\mathrm{in}}) Self-attention maps St,iinS_{t,i}^{\rm in} are recorded at each block.

On the backward pass, SageBwd runs from zTz_T with sampling guided by the edited prompt poutp_\mathrm{out}: εθt,cfg=εθt(zt,pin)+w(εθt(zt,pout)εθt(zt,pin))\varepsilon_{\theta}^{t,\mathrm{cfg}} = \varepsilon_\theta^t(z_t, p_\mathrm{in}) + w \left( \varepsilon_\theta^t(z_t, p_\mathrm{out}) - \varepsilon_\theta^t(z_t, p_\mathrm{in}) \right) The DDIM backward step is: z^t1=αt1αt(zt1αtε~θt)+1αt1ε~θt\hat z_{t-1} = \frac{\sqrt{\alpha_{t-1}}}{\sqrt{\alpha_t}} \Bigl( z_t - \sqrt{1-\alpha_t}\,\tilde\varepsilon_\theta^t \Bigr) + \sqrt{1-\alpha_{t-1}}\,\tilde\varepsilon_\theta^t Self-attention loss is applied to enforce region fidelity: Ltself=i=1NSt,ioutSt,iin1\mathcal L_t^{\rm self} = \sum_{i=1}^N \left\lVert S_{t,i}^{\rm out} - S_{t,i}^{\rm in} \right\rVert_1 A gradient correction is performed: ztztλtztLtselfz_t \leftarrow z_t - \lambda_t \nabla_{z_t}\mathcal L_t^{\rm self} with time-decaying scale λt\lambda_t.

Common settings and hyperparameters include T=50T=50 timesteps, classifier-free guidance scale w=7.5w=7.5, maximum self-attention guidance λmax200\lambda_\mathrm{max} \approx 200, attention resolutions (32×32 self-attention for 512×512, 24×24 for 768×768), and FP16 loss scaling by 500 for underflow avoidance (Gomez-Trenado et al., 14 May 2025).

2. SageBwd in SageAttention3: Low-Bit Backward Attention Kernel

SageBwd also denotes the backward kernel in SageAttention3, which applies INT8 quantization to the majority of attention matrix multiplications in the backward pass, dramatically accelerating gradient computation with minimal accuracy tradeoff.

For any block XX (FP16, shape B×DB\times D), quantization is: sX=max(X)127RFP32,X^=X/sXZ[127,127]s_X = \frac{\max \left(|X|\right)}{127} \in \mathbb{R}_{\mathrm{FP32}}, \qquad \hat{X} = \left\lfloor X / s_X \right\rceil \in \mathbb{Z}_{[-127,127]} Dequantization utilizes X=sXX^X' = s_X \cdot \hat{X}. The kernel applies this quantization to Q, K, V, softmax outputs PijP_{ij}, output gradients dOidO_i, and intermediate gradients dSijdS_{ij}. The only exception is the dOVdO V^\top matmul, which remains in FP16 for stability:

  • Six out of seven backward matmuls use INT8×INT8→FP32 Tensor-Core instructions.
  • The critical dOVdO V^\top matmul stays in FP16, which improves cosine similarity of dQdQ from 97.47% to 99.77%.

Kernel architecture uses Triton for producer/consumer warp scheduling; softmax maxima are reused for efficient normalization; per-block max-reductions are fused with warp-shuffle patterns for reduced overhead. Ping-pong scheduling overlaps loads, matmuls, quantization, and store steps (Zhang et al., 16 May 2025).

3. Algorithmic Pseudocode and Implementation

Key steps in SageBwd for SAGE:

  1. Encode the input image and perform DDIM inversion, storing self-attention maps.
  2. On the backward pass, compute noise predictions under both prompts.
  3. Apply classifier-free guidance.
  4. Execute DDIM backward sampling.
  5. Extract self-attention maps and calculate self-attention reconstruction loss.
  6. Backpropagate the loss for a gradient step in latent space.
  7. Optionally blend unedited regions or replace tokens via masks and cross-attention mechanisms.

Principal workflow:

  1. Quantize forward activations and gradients per block using ψ()\psi(\cdot).
  2. Initialize accumulators for dQ,dK,dVdQ, dK, dV.
  3. Iterate over block pairs, reconstruct S and P in FP32, quantize P, and update dVdV in INT8.
  4. Compute dP,dSdP, dS in FP16 for numerical stability.
  5. Quantize dSdS and update dQ,dKdQ, dK with INT8 matmuls.

4. Performance, Fidelity, and Empirical Metrics

In SAGE, SageBwd preserves unedited regions with high fidelity while enabling efficient region-targeted edits. It offers significant quantitative improvements, with SAGE ranking top in 7/10 benchmarking analyses and receiving unanimous user-study preference (Gomez-Trenado et al., 14 May 2025).

For attention kernels, SageBwd (8-bit) achieves:

  • Kernel throughput of ≈2.0 TFLOP/s on RTX 4090 (1.67× over FlashAttention2 FP16).
  • End-to-end latency reduction: 1.9 s/step for Llama 8k context vs. 2.1 s/step with BF16 FlashAttention2 (1.15× speedup).
  • Minimal memory overhead (+4 bytes/block for INT8 scale), <0.1% of activation memory (Zhang et al., 16 May 2025).

Numerical accuracy is lossless for finetuning—downstream metrics within ±0.5% of BF16. In full pretraining, SageBwd converges slower (~10–15%), primarily due to INT8 error accumulation across longer training chains (Zhang et al., 16 May 2025).

5. Critical Hyperparameters and Best Practices

SageBwd implementations depend on carefully tuned hyperparameters:

  • SAGE: T=50T=50–100, w=1w=1–15, λmax=200\lambda_{\rm max}=200, λt=λmax(t/T)\lambda_t=\lambda_{\rm max}(t/T) (linear/quadratic decay), specific attention map sizes per image resolution.
  • SageAttention3: Retain dOVdO V^\top matmul in FP16, per-block quantization for all matrices except softmax output (per-token). Fuse quantization with softmax maxima and reuse forward-pass softmax maxima in backward. Mask and replace cross-attention blocks in SAGE for fine-grained control.

Multi-GPU environments should interleave kernel execution with gradient accumulation for efficiency; quantization overhead is negligible in practice (Zhang et al., 16 May 2025).

6. Applications and Significance

SageBwd-augmented SAGE is specialized for high-fidelity, efficient real-image editing with pretrained diffusion models, enabling controlled, region-specific modifications that preserve non-edited content (Gomez-Trenado et al., 14 May 2025). SageBwd in SageAttention3 unlocks resource-efficient training for large models and fine-tuning scenarios, offering nearly lossless backward gradient propagation under aggressive quantization while reducing computational cost and latency (Zhang et al., 16 May 2025).

A plausible implication is that the separation of backward-only mechanisms in specialized workflows (structural guidance for editing and INT8 quantization for training) represents a key innovation in exploiting model invariances and computational hardware for domain-optimized deep learning pipelines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SageBwd.