SageBwd: Backward Sampling & INT8 Optimization

Updated 16 January 2026

SageBwd is a dual-implementation technique that employs backward-only sampling for diffusion image editing and INT8 quantized backward attention for efficient training.
In SAGE, backward sampling uses DDIM inversion with self-attention guidance and loss correction to preserve structural fidelity during targeted modifications.
The INT8 attention kernel accelerates gradient computation by leveraging quantization, achieving up to 1.15× speedup with minimal memory overhead and accuracy loss.

SageBwd refers to two distinct but influential implementations in deep learning systems: the backward-only sampling procedure of SAGE (Self-Attention Guidance for image Editing) for diffusion-based image editing (Gomez-Trenado et al., 14 May 2025), and the low-bit backward-pass attention kernel in SageAttention3 for efficient attention gradient computation during model training (Zhang et al., 16 May 2025). Each manifestation applies specialized backward methodologies to optimize performance and address critical fidelity and efficiency concerns in its respective domain.

1. SageBwd in SAGE: Backward-Only Sampling for Diffusion Image Editing

The SAGE framework leverages DDIM (Deterministic Diffusion Implicit Models) inversion to enable efficient and targeted editing of real images using pre-trained diffusion models. SageBwd implements the backward-only chain, sampling from the inverted trajectory under a new prompt while enforcing structural preservation via self-attention guidance.

The DDIM inversion pass encodes the input image $x_0$ to a latent $z_0 = \mathrm{Enc}(x_0)$ and computes a series of latents $\{ z_t^\mathrm{inv} \}$ : $z_{t+1}^{\mathrm{inv}} = \frac{\sqrt{\alpha_{t+1}}}{\sqrt{\alpha_t}} \left(z_t^{\mathrm{inv}} - \sqrt{1-\alpha_t}\,\varepsilon_\theta^t(z_t^{\mathrm{inv}}, p_{\mathrm{in}})\right) + \sqrt{1-\alpha_{t+1}}\,\varepsilon_\theta^t(z_t^{\mathrm{inv}}, p_{\mathrm{in}})$ Self-attention maps $S_{t,i}^{\rm in}$ are recorded at each block.

On the backward pass, SageBwd runs from $z_T$ with sampling guided by the edited prompt $p_\mathrm{out}$ : $\varepsilon_{\theta}^{t,\mathrm{cfg}} = \varepsilon_\theta^t(z_t, p_\mathrm{in}) + w \left( \varepsilon_\theta^t(z_t, p_\mathrm{out}) - \varepsilon_\theta^t(z_t, p_\mathrm{in}) \right)$ The DDIM backward step is: $\hat z_{t-1} = \frac{\sqrt{\alpha_{t-1}}}{\sqrt{\alpha_t}} \Bigl( z_t - \sqrt{1-\alpha_t}\,\tilde\varepsilon_\theta^t \Bigr) + \sqrt{1-\alpha_{t-1}}\,\tilde\varepsilon_\theta^t$ Self-attention loss is applied to enforce region fidelity: $\mathcal L_t^{\rm self} = \sum_{i=1}^N \left\lVert S_{t,i}^{\rm out} - S_{t,i}^{\rm in} \right\rVert_1$ A gradient correction is performed: $z_t \leftarrow z_t - \lambda_t \nabla_{z_t}\mathcal L_t^{\rm self}$ with time-decaying scale $\lambda_t$ .

Common settings and hyperparameters include $T=50$ timesteps, classifier-free guidance scale $w=7.5$ , maximum self-attention guidance $\lambda_\mathrm{max} \approx 200$ , attention resolutions (32×32 self-attention for 512×512, 24×24 for 768×768), and FP16 loss scaling by 500 for underflow avoidance (Gomez-Trenado et al., 14 May 2025).

2. SageBwd in SageAttention3: Low-Bit Backward Attention Kernel

SageBwd also denotes the backward kernel in SageAttention3, which applies INT8 quantization to the majority of attention matrix multiplications in the backward pass, dramatically accelerating gradient computation with minimal accuracy tradeoff.

For any block $X$ (FP16, shape $B\times D$ ), quantization is: $s_X = \frac{\max \left(|X|\right)}{127} \in \mathbb{R}_{\mathrm{FP32}}, \qquad \hat{X} = \left\lfloor X / s_X \right\rceil \in \mathbb{Z}_{[-127,127]}$ Dequantization utilizes $X' = s_X \cdot \hat{X}$ . The kernel applies this quantization to Q, K, V, softmax outputs $P_{ij}$ , output gradients $dO_i$ , and intermediate gradients $dS_{ij}$ . The only exception is the $dO V^\top$ matmul, which remains in FP16 for stability:

Six out of seven backward matmuls use INT8×INT8→FP32 Tensor-Core instructions.
The critical $dO V^\top$ matmul stays in FP16, which improves cosine similarity of $dQ$ from 97.47% to 99.77%.

Kernel architecture uses Triton for producer/consumer warp scheduling; softmax maxima are reused for efficient normalization; per-block max-reductions are fused with warp-shuffle patterns for reduced overhead. Ping-pong scheduling overlaps loads, matmuls, quantization, and store steps (Zhang et al., 16 May 2025).

3. Algorithmic Pseudocode and Implementation

Key steps in SageBwd for SAGE:

Encode the input image and perform DDIM inversion, storing self-attention maps.
On the backward pass, compute noise predictions under both prompts.
Apply classifier-free guidance.
Execute DDIM backward sampling.
Extract self-attention maps and calculate self-attention reconstruction loss.
Backpropagate the loss for a gradient step in latent space.
Optionally blend unedited regions or replace tokens via masks and cross-attention mechanisms.

Principal workflow:

Quantize forward activations and gradients per block using $\psi(\cdot)$ .
Initialize accumulators for $dQ, dK, dV$ .
Iterate over block pairs, reconstruct S and P in FP32, quantize P, and update $dV$ in INT8.
Compute $dP, dS$ in FP16 for numerical stability.
Quantize $dS$ and update $dQ, dK$ with INT8 matmuls.

4. Performance, Fidelity, and Empirical Metrics

In SAGE, SageBwd preserves unedited regions with high fidelity while enabling efficient region-targeted edits. It offers significant quantitative improvements, with SAGE ranking top in 7/10 benchmarking analyses and receiving unanimous user-study preference (Gomez-Trenado et al., 14 May 2025).

For attention kernels, SageBwd (8-bit) achieves:

Kernel throughput of ≈2.0 TFLOP/s on RTX 4090 (1.67× over FlashAttention2 FP16).
End-to-end latency reduction: 1.9 s/step for Llama 8k context vs. 2.1 s/step with BF16 FlashAttention2 (1.15× speedup).
Minimal memory overhead (+4 bytes/block for INT8 scale), <0.1% of activation memory (Zhang et al., 16 May 2025).

Numerical accuracy is lossless for finetuning—downstream metrics within ±0.5% of BF16. In full pretraining, SageBwd converges slower (~10–15%), primarily due to INT8 error accumulation across longer training chains (Zhang et al., 16 May 2025).

5. Critical Hyperparameters and Best Practices

SageBwd implementations depend on carefully tuned hyperparameters:

SAGE: $T=50$ –100, $w=1$ –15, $\lambda_{\rm max}=200$ , $\lambda_t=\lambda_{\rm max}(t/T)$ (linear/quadratic decay), specific attention map sizes per image resolution.
SageAttention3: Retain $dO V^\top$ matmul in FP16, per-block quantization for all matrices except softmax output (per-token). Fuse quantization with softmax maxima and reuse forward-pass softmax maxima in backward. Mask and replace cross-attention blocks in SAGE for fine-grained control.

Multi-GPU environments should interleave kernel execution with gradient accumulation for efficiency; quantization overhead is negligible in practice (Zhang et al., 16 May 2025).

6. Applications and Significance

SageBwd-augmented SAGE is specialized for high-fidelity, efficient real-image editing with pretrained diffusion models, enabling controlled, region-specific modifications that preserve non-edited content (Gomez-Trenado et al., 14 May 2025). SageBwd in SageAttention3 unlocks resource-efficient training for large models and fine-tuning scenarios, offering nearly lossless backward gradient propagation under aggressive quantization while reducing computational cost and latency (Zhang et al., 16 May 2025).

A plausible implication is that the separation of backward-only mechanisms in specialized workflows (structural guidance for editing and INT8 quantization for training) represents a key innovation in exploiting model invariances and computational hardware for domain-optimized deep learning pipelines.

Markdown Report Issue Upgrade to Chat

References (2)

Don't Forget your Inverse DDIM for Image Editing (2025)

SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SageBwd.

SageBwd: Backward Sampling & INT8 Optimization

1. SageBwd in SAGE: Backward-Only Sampling for Diffusion Image Editing

2. SageBwd in SageAttention3: Low-Bit Backward Attention Kernel

3. Algorithmic Pseudocode and Implementation

SAGE Backward Sampling (Image Editing) (Gomez-Trenado et al., 14 May 2025)

SageBwd INT8 Attention Gradient Kernel (Zhang et al., 16 May 2025)

4. Performance, Fidelity, and Empirical Metrics

5. Critical Hyperparameters and Best Practices

6. Applications and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

SageBwd: Backward Sampling & INT8 Optimization

1. SageBwd in SAGE: Backward-Only Sampling for Diffusion Image Editing

2. SageBwd in SageAttention3: Low-Bit Backward Attention Kernel

3. Algorithmic Pseudocode and Implementation

SAGE Backward Sampling (Image Editing) (Gomez-Trenado et al., 14 May 2025)

SageBwd INT8 Attention Gradient Kernel (Zhang et al., 16 May 2025)

4. Performance, Fidelity, and Empirical Metrics

5. Critical Hyperparameters and Best Practices

6. Applications and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics