Pyramidal Patchification Flow (PPFlow)

Updated 6 March 2026

Pyramidal Patchification Flow (PPFlow) is a framework that accelerates image generation by dynamically adjusting patch sizes during the denoising process.
It applies a pyramidal scheduling method with distinct learnable input/output projections at varying patch sizes without altering the core diffusion transformer architecture.
Empirical results show up to 2.0× inference speedup and improved FID scores, demonstrating efficient integration with DDPM/DDIM samplers for high-fidelity outputs.

Pyramidal Patchification Flow (PPFlow) is a framework for accelerating diffusion transformers (DiTs) in visual generation by dynamically adjusting patchification granularity throughout the denoising timeline. Rather than using a fixed patch size for the full reverse diffusion process, PPFlow segments time into pyramidal intervals, applying larger patches (lower token count) at high-noise (early) timesteps and smaller patches (higher fidelity) at low-noise (late) steps. This approach preserves full-resolution latent representations and leverages distinct, learnable input/output projections for each patch size, integrating with standard DDPM/DDIM samplers while requiring no core changes to DiT blocks or auxiliary renoising mechanisms. Empirically, PPFlow delivers up to 2.0× inference speedup with equivalent or improved image generation metrics relative to baseline DiTs (Li et al., 30 Jun 2025).

1. Formal Definition and Patchification Schedule

PPFlow partitions the normalized diffusion timeline $t \in [0,1]$ , or discrete timesteps $t = 0, \ldots, T$ , into $L$ contiguous intervals $[\tau_0, \tau_1), [\tau_1, \tau_2), \ldots, [\tau_{L-1}, \tau_L]$ with $\tau_0 = 0$ , $\tau_L = 1$ . Each interval $\mathcal{I}_l = [\tau_{l-1}, \tau_l)$ is assigned a patch size $p_l \in \mathbb{N}$ (in pixels). High-noise (early) stages use larger $p_l$ to minimize token count; low-noise (late) stages use smaller $p_l$ to maximize detail.

The patch size as a function of (continuous) time is

$t = 0, \ldots, T$ 0

At discrete step $t = 0, \ldots, T$ 1, $t = 0, \ldots, T$ 2. Example schedules include:

2-level: $t = 0, \ldots, T$ 3, $t = 0, \ldots, T$ 4, $t = 0, \ldots, T$ 5.
3-level: $t = 0, \ldots, T$ 6, $t = 0, \ldots, T$ 7, $t = 0, \ldots, T$ 8, $t = 0, \ldots, T$ 9, $L$ 0 (with $L$ 1 realized by grouping 2×2 patches).

2. Mathematical Formulation and Operators

Given latent $L$ 2 and patch size $L$ 3:

Number of patches: $L$ 4.
Per-patch dimension: $L$ 5.

Patchify Operator

$L$ 6

Extract non-overlapping patches $L$ 7, each $L$ 8.
Flatten $L$ 9.
Project: $[\tau_0, \tau_1), [\tau_1, \tau_2), \ldots, [\tau_{L-1}, \tau_L]$ 0, where $[\tau_0, \tau_1), [\tau_1, \tau_2), \ldots, [\tau_{L-1}, \tau_L]$ 1. Output $[\tau_0, \tau_1), [\tau_1, \tau_2), \ldots, [\tau_{L-1}, \tau_L]$ 2.

Unpatchify Operator

$[\tau_0, \tau_1), [\tau_1, \tau_2), \ldots, [\tau_{L-1}, \tau_L]$ 3

For each token $[\tau_0, \tau_1), [\tau_1, \tau_2), \ldots, [\tau_{L-1}, \tau_L]$ 4, recover $[\tau_0, \tau_1), [\tau_1, \tau_2), \ldots, [\tau_{L-1}, \tau_L]$ 5 with $[\tau_0, \tau_1), [\tau_1, \tau_2), \ldots, [\tau_{L-1}, \tau_L]$ 6.
Reshape $[\tau_0, \tau_1), [\tau_1, \tau_2), \ldots, [\tau_{L-1}, \tau_L]$ 7 to $[\tau_0, \tau_1), [\tau_1, \tau_2), \ldots, [\tau_{L-1}, \tau_L]$ 8 and re-tile to reconstruct $[\tau_0, \tau_1), [\tau_1, \tau_2), \ldots, [\tau_{L-1}, \tau_L]$ 9.

Integration with Diffusion Sampling

Noise prediction $\tau_0 = 0$ 0; DDPM update:

$\tau_0 = 0$ 1

Standard DDIM formula similarly applies, replacing with the respective re-parameterizations.

3. Architectural and Implementation Considerations

PPFlow contributes $\tau_0 = 0$ 2 distinct pairs of projection matrices $\tau_0 = 0$ 3 (one per patch size), added outside the intact DiT/SiT transformer blocks. All core model parameters (self-attention, MLPs) are shared, and the internal sequence length dynamically matches the number of patches determined by $\tau_0 = 0$ 4. During training, packing techniques such as "patch n' pack" are used to efficiently batch samples with varying token counts.

Positional embeddings are recalculated per sequence length (i.e., per-timestep patch size), using either 2D sinusoidal or learned positional codes. Optionally, a stage-dependent patch-level embedding $\tau_0 = 0$ 5 is injected, which improves FID. No masking or specialized scheduling (such as renoising tricks) is required.

4. Training Protocols and Empirical Performance

Two principal regimes are established:

Training from Scratch: With SiT-B/2 backbone on 256×256 ImageNet z-latents. PPF-B-2 (4→2 patch size) and PPF-B-3 (4→2√2→2) are trained for 7M steps. PPF-B-2 uses 62.5% of SiT-B/2 training FLOPs; PPF-B-3 uses 50.0%. Final FID-50K (250 steps): SiT-B/2: 4.46; PPF-B-2: 4.12 (62.5% FLOPs) → 3.83 (98.2% FLOPs, 11M steps); PPF-B-3: 4.71 (50.0%) → 4.43 (78.5%, 11M steps). Inference speedup (A100): 1.61× for PPF-B-2, 2.04× for PPF-B-3.
Finetuning from Pretrained DiT: PPF-B-2 and PPF-B-3 from SiT-B/2 and PPF-XL-2, PPF-XL-3 from SiT-XL/2, with only 1M additional steps (8–9% FLOPs). Inference FLOPs: PPF-B-2 at 62.0% (1.6×), PPF-B-3 at 49.1% (2.0×), PPF-XL-2 at 62.6% (2.02×), PPF-XL-3 at 49.4%. Best FID: PPF-XL-2 1.99 vs SiT-XL/2 2.15. Tables 1–3 and Figures 1, 4, 6 present detailed benchmarking (Li et al., 30 Jun 2025).

5. Sampling Algorithm

The PPFlow sampling process is as follows:

$\tau_0 = 0$ 8

6. Ablation Findings and Comparative Analyses

Experiments demonstrate:

The number of patchification levels ( $\tau_0 = 0$ 6) trades speed for fidelity: 2-level yields 1.6× speedup (FID ≈ 3.8–4.1); 3-level yields 2.0× (FID ≈ 4.4–4.7).
Adding patch-level embeddings improves FID by 0.8 (22.56→21.73 at 400k steps).
Applying stage-wise classifier-free guidance (CFG) further reduces FID (21.73→13.13 at 400k steps).
Compared to pyramid-representation flow [23], PPFlow achieves lower FID and crisper images without the need for renoising scheduling (Table 3, Figure 1).

7. Advantages, Constraints, and Prospects

PPFlow yields up to 2.0× inference acceleration while matching or improving metrics such as FID, IS, sFID, and precision/recall against conventional DiTs. Because it maintains full-resolution latents and modifies only the input/output projections, DiT internals remain unaltered, facilitating application in both from-scratch and finetuning workflows. No jump points or auxiliary renoising are necessary.

Current limitations include restriction to class-conditional ImageNet; extension to text-to-image or other modalities is uninvestigated. Patchification schedules are fixed; learned or adaptive scheduling may offer further benefits. Additional stages ( $\tau_0 = 0$ 7) or non-uniform grouping could push speedups but introduce challenges in token sequence management.

Potential extensions include combining PPFlow with step-reduction procedures (DDIM, distillation), end-to-end learned stage breakpoints, or extending to video and 3D diffusion transformers where token explosion is prevalent (Li et al., 30 Jun 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Pyramidal Patchification Flow for Visual Generation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pyramidal Patchification Flow (PPFlow).