Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pyramid Patch Transformer (PPT)

Updated 9 June 2026
  • Pyramid Patch Transformer is a fully Transformer-based architecture that fuses fine-grained pixel details with global context through a multi-scale pyramid design.
  • It leverages a modular approach by applying a lightweight patch-based Transformer and a pyramid aggregation mechanism to capture both local textures and large-scale structures.
  • PPT achieves state-of-the-art performance in image fusion tasks across various benchmarks without requiring retraining for different fusion applications.

The Pyramid Patch Transformer (PPT) is a fully Transformer-based encoder–decoder architecture designed for low-level vision tasks, with specific emphasis on image fusion. PPT was introduced to address the challenge of preserving both fine-grained pixel-level details and capturing long-range, non-local context—two properties traditionally divided between convolutional neural networks (CNNs) and Vision Transformers (ViT). Crucially, PPT achieves this synthesis with an entirely convolution-free design, trainable on moderate-scale datasets, and operates “off the shelf” for diverse image fusion tasks without retraining or redesign (Fu et al., 2021).

1. Architectural Motivation and Overview

Traditional CNNs utilize local filters to extract texture and edge structure but require deep stacks or large kernels for global dependencies. ViT architectures, though proficient at global context via patch-level self-attention, typically discard detail within patches and demand pretraining on massive corpora (e.g., JFT-300M) for good generalization. PPT resolves these issues by (a) applying a lightweight Transformer inside every moderately-sized image patch for full pixel-to-pixel relation modeling, and (b) aggregating patch-level features multi-scale to capture both local and global information.

PPT comprises three sequential modules:

  • Patch Transformer: Splits the input into p×pp \times p patches, applies an L-layer Transformer in each, then reassembles the outputs.
  • Pyramid Transformer: Builds a multi-scale image pyramid, applies the Patch Transformer to each scale, upsamples, and concatenates features.
  • Decoder: A lightweight MLP reconstructs the image from concatenated multi-scale features, trained with mean squared error (MSE) loss (Fu et al., 2021).

2. Patch Transformer Structure

Let XRH×W×1X \in \mathbb{R}^{H \times W \times 1} denote a single-channel input. The image is split into h×wh \times w non-overlapping patches (h=H/ph=H/p, w=W/pw=W/p, typically p=32p=32). Patches are reshaped and embedded:

  • Tile: Y0=T2P(X)Rh×w×p×pY_0 = T2P(X) \in \mathbb{R}^{h \times w \times p \times p}
  • Flatten: Y1=reshape(Y0,[h,w,p2])Y_1 = \text{reshape}(Y_0, [h, w, p^2])
  • Embed: Y=MLP(Y1)Rh×w×p2×CY = \text{MLP}(Y_1) \in \mathbb{R}^{h \times w \times p^2 \times C}
  • Positional encoding: add PosRh×w×p2×CPos \in \mathbb{R}^{h \times w \times p^2 \times C}

A small Transformer (typically 2–4 layers) is applied independently to each patch XRH×W×1X \in \mathbb{R}^{H \times W \times 1}0:

  • Multi-head self-attention and MLP, with weights shared across all patches.
  • After Transformer blocks, the outputs XRH×W×1X \in \mathbb{R}^{H \times W \times 1}1 are reassembled into XRH×W×1X \in \mathbb{R}^{H \times W \times 1}2 via a learned linear pooling or reshaping.

The per-patch attention scale is tractable due to the reduced patch size, with memory complexity scaling linearly with the number of patches.

3. Pyramid Transformer and Multi-Scale Feature Synthesis

To capture global context, a multi-scale pyramid is constructed:

  • The image is downsampled by XRH×W×1X \in \mathbb{R}^{H \times W \times 1}3 for levels XRH×W×1X \in \mathbb{R}^{H \times W \times 1}4, with XRH×W×1X \in \mathbb{R}^{H \times W \times 1}5 for input size XRH×W×1X \in \mathbb{R}^{H \times W \times 1}6.
  • At each level, the Patch Transformer is applied, yielding XRH×W×1X \in \mathbb{R}^{H \times W \times 1}7 features, which are then upsampled to the original size.
  • Concatenation along the channel axis provides a multi-scale representation

XRH×W×1X \in \mathbb{R}^{H \times W \times 1}8

Receptive fields grow with pyramid level: the deepest provides near-global context, while shallow levels focus on local detail. This aggregation ensures that both fine structures and large-scale objects are well represented.

4. Decoder and Training Methodology

The decoder is a lightweight pixel-wise MLP:

  • Two fully connected layers per pixel, with GELU nonlinearity and final Tanh activation.
  • Equation: XRH×W×1X \in \mathbb{R}^{H \times W \times 1}9 (Equation 10 in (Fu et al., 2021))
  • End-to-end training uses h×wh \times w0 (MSE) reconstruction loss over generic images.

PPT is trained on MS-COCO (82k images) and ImageNet-1K, using Adam optimizer (h×wh \times w1, batch size 1, 50 epochs, single 8 GB GPU), with input size h×wh \times w2, patch size h×wh \times w3, typical embedding h×wh \times w4, transformer depth h×wh \times w5, and decoder hidden dimension h×wh \times w6.

5. Application to Image Fusion

Once trained, PPT weights are frozen for all downstream fusion tasks. For 2-source fusion (e.g., IR + visible), each input is encoded independently:

  • h×wh \times w7, h×wh \times w8
  • Feature maps fused with simple pixel-level operations:
    • Average: h×wh \times w9
    • Maximum: h=H/ph=H/p0 (elementwise)
    • Softmax fusion: h=H/ph=H/p1, where h=H/ph=H/p2 are softmax-normalized per-element
  • Decoding: h=H/ph=H/p3

No fine-tuning or retraining is required. PPT fusion has been evaluated and achieved top performance across a range of benchmarks:

  • IR+VIS fusion (TNO): SCD↑=1.6261 (1st), SSIM↑=0.7568 (2nd), Nabf↓=0.0005 (1st), Q_S↑=0.7945
  • Multi-focus fusion (Lytro): SD↑=60.64 (1st), EN↑=7.525 (2nd), SSIM↑=0.9182, CC↑=0.9871 (1st)
  • Medical (Harvard MRI+PET): SCD↑=0.9661, VIF↑=0.3236 (1st), CC↑=0.8875
  • Multi-exposure HDR: Q_MI↑=1.0360, Q_NCIE↑=0.8290, CC↑=0.7190

Across all tasks, PPT outperforms or matches the best results on key metrics (SCD, SSIM, FMI, Nabf, CC, EN, VIF, MI) compared to classical and recent CNN/GAN-based fusion networks (Fu et al., 2021).

Fusion Task Top PPT Metric(s) Rank (Among Methods)
IR+VIS (TNO) SCD (1.6261), Nabf (0.0005) 1st
Multi-focus (Lytro) SD (60.64), CC (0.9871) 1st
Medical (Harvard MRI+PET) VIF (0.3236) 1st
Multi-exposure HDR Q_MI (1.0360) 1st

6. Hyperparameters and Implementation Considerations

  • Input size: h=H/ph=H/p4 (can be applied patch-wise to larger images)
  • Patch size: h=H/ph=H/p5 (implying h=H/ph=H/p6 patches per scale at h=H/ph=H/p7)
  • Number of pyramid levels: h=H/ph=H/p8 (i.e., levels 0, 1, 2, 3)
  • Embedding dimension: h=H/ph=H/p9 or w=W/pw=W/p0
  • Transformer depth per patch: w=W/pw=W/p1
  • Fully learnable positional embeddings at the patch-token level
  • Down/Up sampling: Bilinear or nearest-neighbor
  • Activation functions: GELU in MLPs; Tanh at decoder output

Training is performed on general datasets with no fusion pairs, and the model is convolution-free, requiring no patch-wise convolutional priors or architectural amendments for new tasks.

7. Characteristics Enabling State-of-the-Art Performance

PPT achieves efficacy on image fusion tasks through several mechanisms:

  • Preservation of Local Detail: Each patch-wise Transformer directly models every pixel–pixel correlation within its region, supporting edge, texture, and contour retention.
  • Global Contextual Reasoning: The pyramid structure allows features at progressively broader receptive fields, up to the entire image.
  • Data Efficiency: No ultra-large pretraining corpora are needed—MS-COCO and ImageNet-1K suffice for strong generalization.
  • Modularity: The encoder can be reapplied in a Siamese manner to arbitrary multi-source inputs, and features are aligned for simple, direct fusion.
  • Convolution-Free Paradigm: The design avoids the locality biases and depth constraints of CNNs, utilizing only Transformer mechanisms for the entire signal path.

PPT sets new state-of-the-art results in five image fusion tasks, simultaneously attaining the highest values for structural similarity, mutual information, correlation, and lowest artificial noise across all major benchmarks, all using a single, frozen network architecture (Fu et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pyramid Patch Transformer (PPT).