Decoder Denoising Pretraining for Semantic Segmentation (2205.11423v1)

Published 23 May 2022 in cs.CV

Abstract: Semantic segmentation labels are expensive and time consuming to acquire. Hence, pretraining is commonly used to improve the label-efficiency of segmentation models. Typically, the encoder of a segmentation model is pretrained as a classifier and the decoder is randomly initialized. Here, we argue that random initialization of the decoder can be suboptimal, especially when few labeled examples are available. We propose a decoder pretraining approach based on denoising, which can be combined with supervised pretraining of the encoder. We find that decoder denoising pretraining on the ImageNet dataset strongly outperforms encoder-only supervised pretraining. Despite its simplicity, decoder denoising pretraining achieves state-of-the-art results on label-efficient semantic segmentation and offers considerable gains on the Cityscapes, Pascal Context, and ADE20K datasets.

Citations (23)

View on Semantic Scholar

Summary

Decoder Denoising Pretraining for Semantic Segmentation

Semantic segmentation tasks in computer vision rely heavily on dense pixel-level predictions, which require substantial annotation efforts. To mitigate the burden of data labeling, pretraining strategies are often employed to improve the label efficiency of segmentation models. Typically, segmentation models pretrain the encoder as a classifier while initializing the decoder randomly. This paper challenges the prevailing notion by proposing a decoder pretraining approach based on denoising, specifically aimed at enhancing model performance when limited labeled data is available.

Methodology Overview

The central argument of this paper is that random initialization of the decoder is suboptimal. The authors introduce Decoder Denoising Pretraining (DDeP), a method that applies denoising strategies to pretrain the decoder, while the encoder benefits from supervised pretraining. The approach draws inspiration from denoising autoencoders and recent advancements in Denoising Diffusion Probabilistic Models (DPMs), which have shown remarkable results in generative tasks.

Implementation Details

The encoder-decoder architecture used in this paper employs a TransUNet model with Hybrid-ViT as the backbone. The encoder is pretrained on the ImageNet dataset using supervised classification objectives. The novel element of this work is the pretraining of the decoder using denoising techniques:

Denoising Target: The process involves adding Gaussian noise to input images and training the decoder to reconstruct the clean image from this corrupted input. DDeP operates by focusing on noise prediction rather than image prediction, which empirically yields better performance.
Noise Scaling: The approach leverages scalable denoising objectives, allowing for relative scaling of images and noise components, akin to strategies employed in diffusion models.
Pretraining Dataset: While ImageNet-21K serves primarily for encoder pretraining using labels, the decoder pretraining benefits significantly from the same dataset, even when transferred to different segmentation datasets like Cityscapes, Pascal Context, and ADE20K.

Results and Evaluation

Empirical evaluation across several benchmark datasets demonstrates that Decoder Denoising Pretraining significantly enhances semantic segmentation performance, particularly in low-label scenarios. Key findings include:

Pascal Context Dataset: DDeP improves mean IoU considerably in constrained data settings, outperforming the encoder-only supervised pretraining by wide margins.
Cityscapes: In full and partial label settings, DDeP consistently achieves higher segmentation accuracy compared to models initialized with random decoders or supervised encoder pretrained models.
Cross-Dataset Generalization: Despite potential distribution shifts, pretraining the decoder on a generic image dataset (ImageNet-21K) enhances segmentation results across disparate target datasets.

Implications and Future Directions

Decoder Denoising Pretraining presents a compelling direction for enhancing semantic segmentation models, particularly in scenarios where annotated data is scarce. Its promise lies in the ability to unlock advanced decoder architectures that enable scaling beyond traditional limits imposed by random initialization. There remain interesting avenues to explore the synergy between denoising-based pretraining and other label-efficient methodologies like self-supervised learning or data-efficient training paradigms.

The paper opens pathways for further research into applying denoising principles to other dense prediction tasks, encouraging exploration into the theoretical underpinnings of denoising autoencoders and diffusion models in learning transferable image representations.

Related Papers

Tweets

https://twitter.com/RisingSayak/status/1875471343376069060

YouTube

Show All Videos