Papers
Topics
Authors
Recent
Search
2000 character limit reached

Next-Patch Diffusion: Patchwise Generative Modeling

Updated 2 February 2026
  • The paper introduces a patchwise diffusion framework that reduces memory overhead by processing image patches independently while retaining competitive image quality.
  • It employs innovative position-wise and global content conditioning to ensure spatial coherence and semantic consistency across patch boundaries.
  • Empirical evaluations on CelebA and LSUN demonstrate significant GPU memory savings and acceptable FID trade-offs, highlighting its potential for resource-constrained environments.

Next-Patch Diffusion, as introduced in "Memory Efficient Diffusion Probabilistic Models via Patch-based Generation" (Arakawa et al., 2023), is a framework for generative modeling that reduces memory overhead by restructuring the diffusion process into a patchwise paradigm. The approach partitions high-resolution images into non-overlapping patches, leverages explicit position encoding and global content features, and applies the reverse denoising process on each patch independently during both training and inference. This methodology maintains competitive image fidelity while yielding substantial resource savings, particularly suited for deployment on edge devices.

1. Patchwise Diffusion Process Overview

The core principle of Next-Patch Diffusion is to perform the reverse diffusion process patch-by-patch instead of over the entire image simultaneously. For an image of size H×WH \times W, it is split into N×NN \times N patches, each with size H′=H/NH' = H/N and W′=W/NW' = W/N. Patches are indexed by s∈{0,...,N2−1}s \in \{0, ..., N^2-1\} with spatial coordinates (i,j)(i, j) determined as i=⌊s/N⌋i = \lfloor s/N \rfloor, j=s mod Nj = s \bmod N. At every reverse diffusion step, each patch is cropped, conditioned, denoised, and then written back to the global image tensor.

Crucially, the forward diffusion kernel remains unchanged, and the restriction to a patch amounts to the subregion extraction of xtx_t. The reverse process, however, must maintain spatial coherence and semantic consistency as each patch is processed independently, motivating novel conditioning schemes.

2. Conditioning Mechanisms

Next-Patch Diffusion utilizes two explicit conditioning strategies to address information fragmentation:

a) Position-wise Conditioning

Each patch index ss is represented by a one-hot vector es∈{0,1}N2e_s \in \{0,1\}^{N^2}, projected via a learned fully-connected layer to form a position embedding hs=Wposes+bposh_s = W_\mathrm{pos} e_s + b_\mathrm{pos}. This embedding is incorporated (added or concatenated) to every U-Net block alongside the time-step embedding τt\tau_t, constructed by a small MLP from a sinusoidal encoding of tt. This ensures spatial awareness for the denoising network.

b) Global Content Conditioning (GCC)

At each reverse diffusion step, the global noisy image xt∈RC×H×Wx_t \in \mathbb{R}^{C \times H \times W} is average-pooled with kernel and stride (N,N)(N, N) to form g∈RC×H′×W′g \in \mathbb{R}^{C \times H' \times W'}. For each patch, gg is concatenated along the channel dimension, yielding 2C×H′×W′2C \times H' \times W' input to the denoiser. This global context feature enforces coarse layout consistency across patch boundaries.

3. Patch-level Training and Inference

The inference pipeline comprises iterating reversed diffusion steps t=T,...,1t = T, ..., 1, each involving:

  • Extraction of global context gg via average pooling.
  • Iterative processing of each patch ss:

    • Cropping patch xt(s)x_t^{(s)}.
    • Computing positional and temporal embeddings.
    • Concatenating xt(s)x_t^{(s)} and gg, inputting to the denoiser ϵθ\epsilon_\theta.
    • Sampling the denoised patch using the DDPM update:

    μ=xt(s)−βt1−αˉtϵ^αt,xt−1(s)=μ+σtz,z∼N(0,I)\mu = \frac{x_t^{(s)} - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \hat{\epsilon}}{\sqrt{\alpha_t}}, \qquad x_{t-1}^{(s)} = \mu + \sigma_t z, \quad z \sim \mathcal{N}(0,I) - Writing the denoised patch back to its place in the spatial grid.

The forward process on the full image follows the standard DDPM formulation:

q(xt∣xt−1)=N(xt;1−βtxt−1,βtI)q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_tI)

and the per-patch restriction similarly:

q(xt(s)∣xt−1(s))=N(xt(s);1−βtxt−1(s),βtI)q(x_t^{(s)} | x_{t-1}^{(s)}) = \mathcal{N}(x_t^{(s)}; \sqrt{1-\beta_t} x_{t-1}^{(s)}, \beta_t I)

Training uses the standard DDPM noise prediction objective but applied to each patch:

L=Et,x0,ϵ,s[∥ϵ−ϵθ(concat[x~t(s),g],t,hs)∥2]\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon, s}\left[ \| \epsilon - \epsilon_\theta(\mathrm{concat}[ \tilde{x}_t^{(s)}, g ], t, h_s ) \|^2 \right]

with x~t(s)=αˉtx0(s)+1−αˉtϵ\tilde{x}_t^{(s)} = \sqrt{\bar{\alpha}_t} x_0^{(s)} + \sqrt{1-\bar{\alpha}_t} \epsilon.

4. Modified U-Net Denoiser Architecture

The denoising backbone follows a U-Net architecture reminiscent of DDPM++ (cf. [Ho et al.]), with specific adaptations for patchwise operation:

  • Input features: (2C,H′,W′)(2C, H', W') concatenated from the noisy patch and global content.
  • Encoder: successive Conv2d layers, GroupNorm, Swish activations, and downsampling.
  • Bottleneck: self-attention on deep features.
  • Decoder: mirrored upsampling, Conv2d, and normalization.
  • Block-wise conditioning: both Ï„t\tau_t (256-dimensional sinusoidal time embedding) and hsh_s (64-dimensional position embedding) processed via small MLPs and added as bias to activations at every block.

Channel widths and block counts mirror DDPM++ but operate on reduced spatial resolution corresponding to patch size.

5. Quantitative and Qualitative Evaluation

The framework demonstrates a moderate trade-off between memory efficiency and sample quality. On CelebA and LSUN Bedroom (both 1282128^2 resolution), Frechet Inception Distance (FID) and maximum GPU memory usage (GB) for varying NN are as follows:

CelebA Results

Division NN FID ↓ Max GPU mem (GB)
1 (full) 14.8 7.46
2×2 14.0 3.25
4×4 13.6 2.14
8×8 16.2 1.83

LSUN Bedroom Results

Division NN FID ↓ Max GPU mem (GB)
1 22.4 7.46
2×2 24.1 3.25
4×4 28.8 2.14
8×8 66.1 1.83

Qualitatively, patch seams become evident at N≥4N \ge 4, with complex scenes (LSUN) experiencing degraded coherence. For N=2N=2, visual consistency is retained while halving maximum memory usage. This suggests that coarse partitioning (N=2N=2) affords significant gains with minimal perceptual cost, whereas finer patching amplifies discontinuities.

6. Implementation Considerations

  • Patch extraction uses no overlap.
  • Sinusoidal time embeddings Ï„t\tau_t are 256-dimensional.
  • Position embeddings hsh_s have 64 dimensions.
  • Model is trained with batch size 64, Adam optimizer (learning rate 1×10−41 \times 10^{-4}), for 800,000 iterations.
  • Diffusion steps T=1000T=1000, using a cosine schedule for noise variance βt\beta_t.
  • GCC features are recalculated at every forward and reverse step.
  • The approach is compatible with DDPM++ configuration, granting easy adaptation to existing pipelines.

7. Significance and Potential Extensions

Next-Patch Diffusion provides an effective mechanism to circumvent the memory-intensive nature of full-image DDPMs, particularly when scaling to high spatial resolutions or deploying on VRAM-constrained hardware. The explicit separation of position and global content signals facilitates synthesis across independently processed patches. A plausible implication is that further improvements may arise from investigating overlap, attention across patches, or dynamic patch sizes, although these modifications are not substantiated by current empirical evidence in (Arakawa et al., 2023). The approach establishes a practical baseline for memory-optimized generative modeling without severe quality penalties, especially for portrait and simple scene datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Next-Patch Diffusion.