Pre-Trained Image Processing Transformer (2012.00364v4)

Published 1 Dec 2020 in cs.CV and cs.LG

Abstract: As the computing power of modern hardware is increasing strongly, pre-trained deep learning models (e.g., BERT, GPT-3) learned on large-scale datasets have shown their effectiveness over conventional methods. The big progress is mainly contributed to the representation ability of transformer and its variant architectures. In this paper, we study the low-level computer vision task (e.g., denoising, super-resolution and deraining) and develop a new pre-trained model, namely, image processing transformer (IPT). To maximally excavate the capability of transformer, we present to utilize the well-known ImageNet benchmark for generating a large amount of corrupted image pairs. The IPT model is trained on these images with multi-heads and multi-tails. In addition, the contrastive learning is introduced for well adapting to different image processing tasks. The pre-trained model can therefore efficiently employed on desired task after fine-tuning. With only one pre-trained model, IPT outperforms the current state-of-the-art methods on various low-level benchmarks. Code is available at https://github.com/huawei-noah/Pretrained-IPT and https://gitee.com/mindspore/mindspore/tree/master/model_zoo/research/cv/IPT

Citations (1,468)

View on Semantic Scholar

Summary

The paper presents IPT, which leverages a Transformer architecture to tackle low-level vision tasks such as denoising, super-resolution, and deraining.
It implements multi-headed feature extraction and a novel encoder-decoder structure combined with contrastive learning for robust image reconstruction.
Extensive benchmarks demonstrate that IPT outperforms state-of-the-art methods on datasets like Urban100 for super-resolution and BSD68 for denoising.

Overview of "Pre-Trained Image Processing Transformer"

The paper "Pre-Trained Image Processing Transformer" by Hanting Chen et al. presents the development of a novel pre-trained model named Image Processing Transformer (IPT) targeted at low-level computer vision tasks, such as denoising, super-resolution, and deraining. Leveraging the transformative capabilities of the Transformer architecture, which has already shown substantial success in NLP tasks, the authors introduce a model that performs exceptionally well across multiple image processing domains. The IPT model is trained on a synthesized large-scale dataset derived from ImageNet to excavate the full potential of Transformer for diverse image processing tasks.

Model Architecture

The IPT model comprises three key components:

Multi-Headed Feature Extraction: This component employs multiple heads for different image processing tasks to extract features from the input image.
Transformer Encoder-Decoder: Utilizing the novel self-attention mechanism, this core component comprises a set of encoder and decoder layers. The encoders are responsible for capturing global contextual information, while the decoders adapt this into task-specific outputs.
Task-Specific Output Layers: Multiple specialized 'tail' modules convert the intermediate encoded data back into image format after appropriate post-processing steps for each image processing task.

Training Strategy

To train IPT, the authors utilize a combination of supervised and self-supervised learning. The process involves the following steps:

Dataset Generation: High-diversity images from ImageNet are corrupted using different degradation models for super-resolution, denoising, and deraining tasks, resulting in over 10 million training samples.
Supervised Learning: Corrupted images serve as input while their original counterparts act as ground truth. IPT is trained to reconstruct the original image, using an L1 loss function for pixel-wise accuracy.
Contrastive Learning: To generalize better across varied image processing tasks, contrastive learning is applied. This method ensures that patches (akin to "words" in NLP) within the same image are closer in feature space, while those from different images are farther apart.

Performance Evaluation

The authors conduct extensive experiments to validate the efficiency and effectiveness of their IPT model. The following are some highlights:

Super-Resolution: Across multiple benchmark datasets (Set5, Set14, B100, Urban100), the IPT model consistently outperforms existing state-of-the-art methods for 2×, 3×, and 4× super-resolution tasks. Notably, IPT shows a significant PSNR boost in the Urban100 dataset, indicating its superior detail preservation and reconstruction capabilities.
Denoising: For color image denoising tasks, IPT sets new performance records on the BSD68 and Urban100 datasets at different noise levels (σ = 30, 50). The model's ability to retain details, particularly in regions with substantial noise, is demonstrably higher than prior methods.
Deraining: On the Rain100L dataset, IPT achieves the best PSNR results, again surpassing previous state-of-the-art techniques and showcasing the versatility of pre-trained Transformers in removing synthetic rain streaks effectively.

Implications and Future Work

The research by Chen et al. sets a clear precedence for leveraging Transformers in general image processing tasks. The approach of training with a large-scale corrupted dataset addresses the sparsity of task-specific data, which is a common limitation in image processing. Furthermore, the integration of contrastive learning suggests an interesting avenue for future exploration in self-supervised learning paradigms within low-level vision tasks.

The success of IPT invites further development in several directions:

Extended Applicability: Exploring additional image processing tasks such as inpainting, dehazing, and deblurring.
Scaling and Optimization: Investigating methods to reduce the model complexity and computational requirements without compromising performance, making IPT more feasible for resource-constrained environments.
Hybrid Architectures: Combining the strengths of CNNs and Transformers to exploit local and global structural information synergistically.

In conclusion, the paper presents a compelling case for the adoption of Transformer-based models in image processing, showcasing that pre-training not only boosts performance across diverse tasks but also provides a more general and robust feature representation. This work lays a substantial foundation for future enhancements and broadening the scope of AI-driven image enhancement technologies.

PDF Markdown