Pyramid Patch Transformer (PPT)

Updated 9 June 2026

Pyramid Patch Transformer is a fully Transformer-based architecture that fuses fine-grained pixel details with global context through a multi-scale pyramid design.
It leverages a modular approach by applying a lightweight patch-based Transformer and a pyramid aggregation mechanism to capture both local textures and large-scale structures.
PPT achieves state-of-the-art performance in image fusion tasks across various benchmarks without requiring retraining for different fusion applications.

The Pyramid Patch Transformer (PPT) is a fully Transformer-based encoder–decoder architecture designed for low-level vision tasks, with specific emphasis on image fusion. PPT was introduced to address the challenge of preserving both fine-grained pixel-level details and capturing long-range, non-local context—two properties traditionally divided between convolutional neural networks (CNNs) and Vision Transformers (ViT). Crucially, PPT achieves this synthesis with an entirely convolution-free design, trainable on moderate-scale datasets, and operates “off the shelf” for diverse image fusion tasks without retraining or redesign (Fu et al., 2021).

1. Architectural Motivation and Overview

Traditional CNNs utilize local filters to extract texture and edge structure but require deep stacks or large kernels for global dependencies. ViT architectures, though proficient at global context via patch-level self-attention, typically discard detail within patches and demand pretraining on massive corpora (e.g., JFT-300M) for good generalization. PPT resolves these issues by (a) applying a lightweight Transformer inside every moderately-sized image patch for full pixel-to-pixel relation modeling, and (b) aggregating patch-level features multi-scale to capture both local and global information.

PPT comprises three sequential modules:

Patch Transformer: Splits the input into $p \times p$ patches, applies an L-layer Transformer in each, then reassembles the outputs.
Pyramid Transformer: Builds a multi-scale image pyramid, applies the Patch Transformer to each scale, upsamples, and concatenates features.
Decoder: A lightweight MLP reconstructs the image from concatenated multi-scale features, trained with mean squared error (MSE) loss (Fu et al., 2021).

2. Patch Transformer Structure

Let $X \in \mathbb{R}^{H \times W \times 1}$ denote a single-channel input. The image is split into $h \times w$ non-overlapping patches ( $h=H/p$ , $w=W/p$ , typically $p=32$ ). Patches are reshaped and embedded:

Tile: $Y_0 = T2P(X) \in \mathbb{R}^{h \times w \times p \times p}$
Flatten: $Y_1 = \text{reshape}(Y_0, [h, w, p^2])$
Embed: $Y = \text{MLP}(Y_1) \in \mathbb{R}^{h \times w \times p^2 \times C}$
Positional encoding: add $Pos \in \mathbb{R}^{h \times w \times p^2 \times C}$

A small Transformer (typically 2–4 layers) is applied independently to each patch $X \in \mathbb{R}^{H \times W \times 1}$ 0:

Multi-head self-attention and MLP, with weights shared across all patches.
After Transformer blocks, the outputs $X \in \mathbb{R}^{H \times W \times 1}$ 1 are reassembled into $X \in \mathbb{R}^{H \times W \times 1}$ 2 via a learned linear pooling or reshaping.

The per-patch attention scale is tractable due to the reduced patch size, with memory complexity scaling linearly with the number of patches.

3. Pyramid Transformer and Multi-Scale Feature Synthesis

To capture global context, a multi-scale pyramid is constructed:

The image is downsampled by $X \in \mathbb{R}^{H \times W \times 1}$ 3 for levels $X \in \mathbb{R}^{H \times W \times 1}$ 4, with $X \in \mathbb{R}^{H \times W \times 1}$ 5 for input size $X \in \mathbb{R}^{H \times W \times 1}$ 6.
At each level, the Patch Transformer is applied, yielding $X \in \mathbb{R}^{H \times W \times 1}$ 7 features, which are then upsampled to the original size.
Concatenation along the channel axis provides a multi-scale representation

$X \in \mathbb{R}^{H \times W \times 1}$ 8

Receptive fields grow with pyramid level: the deepest provides near-global context, while shallow levels focus on local detail. This aggregation ensures that both fine structures and large-scale objects are well represented.

4. Decoder and Training Methodology

The decoder is a lightweight pixel-wise MLP:

Two fully connected layers per pixel, with GELU nonlinearity and final Tanh activation.
Equation: $X \in \mathbb{R}^{H \times W \times 1}$ 9 (Equation 10 in (Fu et al., 2021))
End-to-end training uses $h \times w$ 0 (MSE) reconstruction loss over generic images.

PPT is trained on MS-COCO (82k images) and ImageNet-1K, using Adam optimizer ( $h \times w$ 1, batch size 1, 50 epochs, single 8 GB GPU), with input size $h \times w$ 2, patch size $h \times w$ 3, typical embedding $h \times w$ 4, transformer depth $h \times w$ 5, and decoder hidden dimension $h \times w$ 6.

5. Application to Image Fusion

Once trained, PPT weights are frozen for all downstream fusion tasks. For 2-source fusion (e.g., IR + visible), each input is encoded independently:

$h \times w$ 7, $h \times w$ 8
Feature maps fused with simple pixel-level operations:
- Average: $h \times w$ 9
- Maximum: $h=H/p$ 0 (elementwise)
- Softmax fusion: $h=H/p$ 1, where $h=H/p$ 2 are softmax-normalized per-element
Decoding: $h=H/p$ 3

No fine-tuning or retraining is required. PPT fusion has been evaluated and achieved top performance across a range of benchmarks:

IR+VIS fusion (TNO): SCD↑=1.6261 (1st), SSIM↑=0.7568 (2nd), Nabf↓=0.0005 (1st), Q_S↑=0.7945
Multi-focus fusion (Lytro): SD↑=60.64 (1st), EN↑=7.525 (2nd), SSIM↑=0.9182, CC↑=0.9871 (1st)
Medical (Harvard MRI+PET): SCD↑=0.9661, VIF↑=0.3236 (1st), CC↑=0.8875
Multi-exposure HDR: Q_MI↑=1.0360, Q_NCIE↑=0.8290, CC↑=0.7190

Across all tasks, PPT outperforms or matches the best results on key metrics (SCD, SSIM, FMI, Nabf, CC, EN, VIF, MI) compared to classical and recent CNN/GAN-based fusion networks (Fu et al., 2021).

Fusion Task	Top PPT Metric(s)	Rank (Among Methods)
IR+VIS (TNO)	SCD (1.6261), Nabf (0.0005)	1st
Multi-focus (Lytro)	SD (60.64), CC (0.9871)	1st
Medical (Harvard MRI+PET)	VIF (0.3236)	1st
Multi-exposure HDR	Q_MI (1.0360)	1st

6. Hyperparameters and Implementation Considerations

Input size: $h=H/p$ 4 (can be applied patch-wise to larger images)
Patch size: $h=H/p$ 5 (implying $h=H/p$ 6 patches per scale at $h=H/p$ 7)
Number of pyramid levels: $h=H/p$ 8 (i.e., levels 0, 1, 2, 3)
Embedding dimension: $h=H/p$ 9 or $w=W/p$ 0
Transformer depth per patch: $w=W/p$ 1
Fully learnable positional embeddings at the patch-token level
Down/Up sampling: Bilinear or nearest-neighbor
Activation functions: GELU in MLPs; Tanh at decoder output

Training is performed on general datasets with no fusion pairs, and the model is convolution-free, requiring no patch-wise convolutional priors or architectural amendments for new tasks.

7. Characteristics Enabling State-of-the-Art Performance

PPT achieves efficacy on image fusion tasks through several mechanisms:

Preservation of Local Detail: Each patch-wise Transformer directly models every pixel–pixel correlation within its region, supporting edge, texture, and contour retention.
Global Contextual Reasoning: The pyramid structure allows features at progressively broader receptive fields, up to the entire image.
Data Efficiency: No ultra-large pretraining corpora are needed—MS-COCO and ImageNet-1K suffice for strong generalization.
Modularity: The encoder can be reapplied in a Siamese manner to arbitrary multi-source inputs, and features are aligned for simple, direct fusion.
Convolution-Free Paradigm: The design avoids the locality biases and depth constraints of CNNs, utilizing only Transformer mechanisms for the entire signal path.

PPT sets new state-of-the-art results in five image fusion tasks, simultaneously attaining the highest values for structural similarity, mutual information, correlation, and lowest artificial noise across all major benchmarks, all using a single, frozen network architecture (Fu et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

PPT Fusion: Pyramid Patch Transformerfor a Case Study in Image Fusion (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pyramid Patch Transformer (PPT).

Pyramid Patch Transformer (PPT)

1. Architectural Motivation and Overview

2. Patch Transformer Structure

3. Pyramid Transformer and Multi-Scale Feature Synthesis

4. Decoder and Training Methodology

5. Application to Image Fusion

6. Hyperparameters and Implementation Considerations

7. Characteristics Enabling State-of-the-Art Performance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Pyramid Patch Transformer (PPT)

1. Architectural Motivation and Overview

2. Patch Transformer Structure

3. Pyramid Transformer and Multi-Scale Feature Synthesis

4. Decoder and Training Methodology

5. Application to Image Fusion

6. Hyperparameters and Implementation Considerations

7. Characteristics Enabling State-of-the-Art Performance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research