Next-Patch Diffusion: Patchwise Generative Modeling

Updated 2 February 2026

The paper introduces a patchwise diffusion framework that reduces memory overhead by processing image patches independently while retaining competitive image quality.
It employs innovative position-wise and global content conditioning to ensure spatial coherence and semantic consistency across patch boundaries.
Empirical evaluations on CelebA and LSUN demonstrate significant GPU memory savings and acceptable FID trade-offs, highlighting its potential for resource-constrained environments.

Next-Patch Diffusion, as introduced in "Memory Efficient Diffusion Probabilistic Models via Patch-based Generation" (Arakawa et al., 2023), is a framework for generative modeling that reduces memory overhead by restructuring the diffusion process into a patchwise paradigm. The approach partitions high-resolution images into non-overlapping patches, leverages explicit position encoding and global content features, and applies the reverse denoising process on each patch independently during both training and inference. This methodology maintains competitive image fidelity while yielding substantial resource savings, particularly suited for deployment on edge devices.

1. Patchwise Diffusion Process Overview

The core principle of Next-Patch Diffusion is to perform the reverse diffusion process patch-by-patch instead of over the entire image simultaneously. For an image of size $H \times W$ , it is split into $N \times N$ patches, each with size $H' = H/N$ and $W' = W/N$ . Patches are indexed by $s \in \{0, ..., N^2-1\}$ with spatial coordinates $(i, j)$ determined as $i = \lfloor s/N \rfloor$ , $j = s \bmod N$ . At every reverse diffusion step, each patch is cropped, conditioned, denoised, and then written back to the global image tensor.

Crucially, the forward diffusion kernel remains unchanged, and the restriction to a patch amounts to the subregion extraction of $x_t$ . The reverse process, however, must maintain spatial coherence and semantic consistency as each patch is processed independently, motivating novel conditioning schemes.

2. Conditioning Mechanisms

Next-Patch Diffusion utilizes two explicit conditioning strategies to address information fragmentation:

a) Position-wise Conditioning

Each patch index $s$ is represented by a one-hot vector $e_s \in \{0,1\}^{N^2}$ , projected via a learned fully-connected layer to form a position embedding $h_s = W_\mathrm{pos} e_s + b_\mathrm{pos}$ . This embedding is incorporated (added or concatenated) to every U-Net block alongside the time-step embedding $\tau_t$ , constructed by a small MLP from a sinusoidal encoding of $t$ . This ensures spatial awareness for the denoising network.

b) Global Content Conditioning (GCC)

At each reverse diffusion step, the global noisy image $x_t \in \mathbb{R}^{C \times H \times W}$ is average-pooled with kernel and stride $(N, N)$ to form $g \in \mathbb{R}^{C \times H' \times W'}$ . For each patch, $g$ is concatenated along the channel dimension, yielding $2C \times H' \times W'$ input to the denoiser. This global context feature enforces coarse layout consistency across patch boundaries.

3. Patch-level Training and Inference

The inference pipeline comprises iterating reversed diffusion steps $t = T, ..., 1$ , each involving:

Extraction of global context $g$ via average pooling.
Iterative processing of each patch $s$ $s$ :
- Cropping patch $x_t^{(s)}$ .
- Computing positional and temporal embeddings.
- Concatenating $x_t^{(s)}$ and $g$ , inputting to the denoiser $\epsilon_\theta$ .
- Sampling the denoised patch using the DDPM update:
$\mu = \frac{x_t^{(s)} - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \hat{\epsilon}}{\sqrt{\alpha_t}}, \qquad x_{t-1}^{(s)} = \mu + \sigma_t z, \quad z \sim \mathcal{N}(0,I)$ - Writing the denoised patch back to its place in the spatial grid.

The forward process on the full image follows the standard DDPM formulation:

$q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_tI)$

and the per-patch restriction similarly:

$q(x_t^{(s)} | x_{t-1}^{(s)}) = \mathcal{N}(x_t^{(s)}; \sqrt{1-\beta_t} x_{t-1}^{(s)}, \beta_t I)$

Training uses the standard DDPM noise prediction objective but applied to each patch:

$\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon, s}\left[ \| \epsilon - \epsilon_\theta(\mathrm{concat}[ \tilde{x}_t^{(s)}, g ], t, h_s ) \|^2 \right]$

with $\tilde{x}_t^{(s)} = \sqrt{\bar{\alpha}_t} x_0^{(s)} + \sqrt{1-\bar{\alpha}_t} \epsilon$ .

4. Modified U-Net Denoiser Architecture

The denoising backbone follows a U-Net architecture reminiscent of DDPM++ (cf. [Ho et al.]), with specific adaptations for patchwise operation:

Input features: $(2C, H', W')$ concatenated from the noisy patch and global content.
Encoder: successive Conv2d layers, GroupNorm, Swish activations, and downsampling.
Bottleneck: self-attention on deep features.
Decoder: mirrored upsampling, Conv2d, and normalization.
Block-wise conditioning: both $\tau_t$ (256-dimensional sinusoidal time embedding) and $h_s$ (64-dimensional position embedding) processed via small MLPs and added as bias to activations at every block.

Channel widths and block counts mirror DDPM++ but operate on reduced spatial resolution corresponding to patch size.

5. Quantitative and Qualitative Evaluation

The framework demonstrates a moderate trade-off between memory efficiency and sample quality. On CelebA and LSUN Bedroom (both $128^2$ resolution), Frechet Inception Distance (FID) and maximum GPU memory usage (GB) for varying $N$ are as follows:

CelebA Results

Division $N$	FID ↓	Max GPU mem (GB)
1 (full)	14.8	7.46
2×2	14.0	3.25
4×4	13.6	2.14
8×8	16.2	1.83

LSUN Bedroom Results

Division $N$	FID ↓	Max GPU mem (GB)
1	22.4	7.46
2×2	24.1	3.25
4×4	28.8	2.14
8×8	66.1	1.83

Qualitatively, patch seams become evident at $N \ge 4$ , with complex scenes (LSUN) experiencing degraded coherence. For $N=2$ , visual consistency is retained while halving maximum memory usage. This suggests that coarse partitioning ( $N=2$ ) affords significant gains with minimal perceptual cost, whereas finer patching amplifies discontinuities.

6. Implementation Considerations

Patch extraction uses no overlap.
Sinusoidal time embeddings $\tau_t$ are 256-dimensional.
Position embeddings $h_s$ have 64 dimensions.
Model is trained with batch size 64, Adam optimizer (learning rate $1 \times 10^{-4}$ ), for 800,000 iterations.
Diffusion steps $T=1000$ , using a cosine schedule for noise variance $\beta_t$ .
GCC features are recalculated at every forward and reverse step.
The approach is compatible with DDPM++ configuration, granting easy adaptation to existing pipelines.

7. Significance and Potential Extensions

Next-Patch Diffusion provides an effective mechanism to circumvent the memory-intensive nature of full-image DDPMs, particularly when scaling to high spatial resolutions or deploying on VRAM-constrained hardware. The explicit separation of position and global content signals facilitates synthesis across independently processed patches. A plausible implication is that further improvements may arise from investigating overlap, attention across patches, or dynamic patch sizes, although these modifications are not substantiated by current empirical evidence in (Arakawa et al., 2023). The approach establishes a practical baseline for memory-optimized generative modeling without severe quality penalties, especially for portrait and simple scene datasets.

Markdown Report Issue Upgrade to Chat

References (1)

Memory Efficient Diffusion Probabilistic Models via Patch-based Generation (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Next-Patch Diffusion.