Pyramid Patch Transformer (PPT)
- Pyramid Patch Transformer is a fully Transformer-based architecture that fuses fine-grained pixel details with global context through a multi-scale pyramid design.
- It leverages a modular approach by applying a lightweight patch-based Transformer and a pyramid aggregation mechanism to capture both local textures and large-scale structures.
- PPT achieves state-of-the-art performance in image fusion tasks across various benchmarks without requiring retraining for different fusion applications.
The Pyramid Patch Transformer (PPT) is a fully Transformer-based encoder–decoder architecture designed for low-level vision tasks, with specific emphasis on image fusion. PPT was introduced to address the challenge of preserving both fine-grained pixel-level details and capturing long-range, non-local context—two properties traditionally divided between convolutional neural networks (CNNs) and Vision Transformers (ViT). Crucially, PPT achieves this synthesis with an entirely convolution-free design, trainable on moderate-scale datasets, and operates “off the shelf” for diverse image fusion tasks without retraining or redesign (Fu et al., 2021).
1. Architectural Motivation and Overview
Traditional CNNs utilize local filters to extract texture and edge structure but require deep stacks or large kernels for global dependencies. ViT architectures, though proficient at global context via patch-level self-attention, typically discard detail within patches and demand pretraining on massive corpora (e.g., JFT-300M) for good generalization. PPT resolves these issues by (a) applying a lightweight Transformer inside every moderately-sized image patch for full pixel-to-pixel relation modeling, and (b) aggregating patch-level features multi-scale to capture both local and global information.
PPT comprises three sequential modules:
- Patch Transformer: Splits the input into patches, applies an L-layer Transformer in each, then reassembles the outputs.
- Pyramid Transformer: Builds a multi-scale image pyramid, applies the Patch Transformer to each scale, upsamples, and concatenates features.
- Decoder: A lightweight MLP reconstructs the image from concatenated multi-scale features, trained with mean squared error (MSE) loss (Fu et al., 2021).
2. Patch Transformer Structure
Let denote a single-channel input. The image is split into non-overlapping patches (, , typically ). Patches are reshaped and embedded:
- Tile:
- Flatten:
- Embed:
- Positional encoding: add
A small Transformer (typically 2–4 layers) is applied independently to each patch 0:
- Multi-head self-attention and MLP, with weights shared across all patches.
- After Transformer blocks, the outputs 1 are reassembled into 2 via a learned linear pooling or reshaping.
The per-patch attention scale is tractable due to the reduced patch size, with memory complexity scaling linearly with the number of patches.
3. Pyramid Transformer and Multi-Scale Feature Synthesis
To capture global context, a multi-scale pyramid is constructed:
- The image is downsampled by 3 for levels 4, with 5 for input size 6.
- At each level, the Patch Transformer is applied, yielding 7 features, which are then upsampled to the original size.
- Concatenation along the channel axis provides a multi-scale representation
8
Receptive fields grow with pyramid level: the deepest provides near-global context, while shallow levels focus on local detail. This aggregation ensures that both fine structures and large-scale objects are well represented.
4. Decoder and Training Methodology
The decoder is a lightweight pixel-wise MLP:
- Two fully connected layers per pixel, with GELU nonlinearity and final Tanh activation.
- Equation: 9 (Equation 10 in (Fu et al., 2021))
- End-to-end training uses 0 (MSE) reconstruction loss over generic images.
PPT is trained on MS-COCO (82k images) and ImageNet-1K, using Adam optimizer (1, batch size 1, 50 epochs, single 8 GB GPU), with input size 2, patch size 3, typical embedding 4, transformer depth 5, and decoder hidden dimension 6.
5. Application to Image Fusion
Once trained, PPT weights are frozen for all downstream fusion tasks. For 2-source fusion (e.g., IR + visible), each input is encoded independently:
- 7, 8
- Feature maps fused with simple pixel-level operations:
- Average: 9
- Maximum: 0 (elementwise)
- Softmax fusion: 1, where 2 are softmax-normalized per-element
- Decoding: 3
No fine-tuning or retraining is required. PPT fusion has been evaluated and achieved top performance across a range of benchmarks:
- IR+VIS fusion (TNO): SCD↑=1.6261 (1st), SSIM↑=0.7568 (2nd), Nabf↓=0.0005 (1st), Q_S↑=0.7945
- Multi-focus fusion (Lytro): SD↑=60.64 (1st), EN↑=7.525 (2nd), SSIM↑=0.9182, CC↑=0.9871 (1st)
- Medical (Harvard MRI+PET): SCD↑=0.9661, VIF↑=0.3236 (1st), CC↑=0.8875
- Multi-exposure HDR: Q_MI↑=1.0360, Q_NCIE↑=0.8290, CC↑=0.7190
Across all tasks, PPT outperforms or matches the best results on key metrics (SCD, SSIM, FMI, Nabf, CC, EN, VIF, MI) compared to classical and recent CNN/GAN-based fusion networks (Fu et al., 2021).
| Fusion Task | Top PPT Metric(s) | Rank (Among Methods) |
|---|---|---|
| IR+VIS (TNO) | SCD (1.6261), Nabf (0.0005) | 1st |
| Multi-focus (Lytro) | SD (60.64), CC (0.9871) | 1st |
| Medical (Harvard MRI+PET) | VIF (0.3236) | 1st |
| Multi-exposure HDR | Q_MI (1.0360) | 1st |
6. Hyperparameters and Implementation Considerations
- Input size: 4 (can be applied patch-wise to larger images)
- Patch size: 5 (implying 6 patches per scale at 7)
- Number of pyramid levels: 8 (i.e., levels 0, 1, 2, 3)
- Embedding dimension: 9 or 0
- Transformer depth per patch: 1
- Fully learnable positional embeddings at the patch-token level
- Down/Up sampling: Bilinear or nearest-neighbor
- Activation functions: GELU in MLPs; Tanh at decoder output
Training is performed on general datasets with no fusion pairs, and the model is convolution-free, requiring no patch-wise convolutional priors or architectural amendments for new tasks.
7. Characteristics Enabling State-of-the-Art Performance
PPT achieves efficacy on image fusion tasks through several mechanisms:
- Preservation of Local Detail: Each patch-wise Transformer directly models every pixel–pixel correlation within its region, supporting edge, texture, and contour retention.
- Global Contextual Reasoning: The pyramid structure allows features at progressively broader receptive fields, up to the entire image.
- Data Efficiency: No ultra-large pretraining corpora are needed—MS-COCO and ImageNet-1K suffice for strong generalization.
- Modularity: The encoder can be reapplied in a Siamese manner to arbitrary multi-source inputs, and features are aligned for simple, direct fusion.
- Convolution-Free Paradigm: The design avoids the locality biases and depth constraints of CNNs, utilizing only Transformer mechanisms for the entire signal path.
PPT sets new state-of-the-art results in five image fusion tasks, simultaneously attaining the highest values for structural similarity, mutual information, correlation, and lowest artificial noise across all major benchmarks, all using a single, frozen network architecture (Fu et al., 2021).