Decoders Matter for Semantic Segmentation: Data-Dependent Decoding Enables Flexible Feature Aggregation (1903.02120v3)

Published 5 Mar 2019 in cs.CV

Abstract: Recent semantic segmentation methods exploit encoder-decoder architectures to produce the desired pixel-wise segmentation prediction. The last layer of the decoders is typically a bilinear upsampling procedure to recover the final pixel-wise prediction. We empirically show that this oversimple and data-independent bilinear upsampling may lead to sub-optimal results. In this work, we propose a data-dependent upsampling (DUpsampling) to replace bilinear, which takes advantages of the redundancy in the label space of semantic segmentation and is able to recover the pixel-wise prediction from low-resolution outputs of CNNs. The main advantage of the new upsampling layer lies in that with a relatively lower-resolution feature map such as $\frac{1}{16}$ or $\frac{1}{32}$ of the input size, we can achieve even better segmentation accuracy, significantly reducing computation complexity. This is made possible by 1) the new upsampling layer's much improved reconstruction capability; and more importantly 2) the DUpsampling based decoder's flexibility in leveraging almost arbitrary combinations of the CNN encoders' features. Experiments demonstrate that our proposed decoder outperforms the state-of-the-art decoder, with only $\sim$20\% of computation. Finally, without any post-processing, the framework equipped with our proposed decoder achieves new state-of-the-art performance on two datasets: 88.1\% mIOU on PASCAL VOC with 30\% computation of the previously best model; and 52.5\% mIOU on PASCAL Context.

Citations (209)

View on Semantic Scholar

Summary

The paper introduces DUpsampling, a novel data-dependent upsampling method that reconstructs high-resolution predictions using label space redundancy.
It demonstrates superior performance by achieving 88.1% mIOU on PASCAL VOC while using only 30% of the computation required by previous models.
The approach enables flexible feature aggregation through an adaptive-temperature softmax, paving the way for efficient, resource-optimized segmentation architectures.

Decoders Matter for Semantic Segmentation: Data-Dependent Decoding Enables Flexible Feature Aggregation

In recent advancements within semantic segmentation, encoder-decoder architectures have been pivotal, particularly through the application of fully convolutional networks (FCNs). The utilization of bilinear upsampling within the decoder has been a conventional technique to achieve pixel-wise predictions from low-resolution outputs. However, the paper under discussion identifies the limitations of the traditional bilinear upsampling, citing its simplicity and data-independence as constraints, thereby opening avenues for potential performance bottlenecks.

The principal contribution of this research is the proposal of a novel data-dependent upsampling method termed DUpsampling. Unlike bilinear upsampling, DUpsampling leverages the inherent redundancy present in the segmentation label space, effectively reconstructing high-resolution predictions from considerably lower-resolution CNN outputs (i.e., downsampled by factors of 16 or 32). This methodology significantly reduces computational complexity, achieving state-of-the-art segmentation accuracy with approximately 20% of the computational cost compared to existing techniques.

Key experimental results showcase DUpsampling's efficacy on two benchmark datasets: achieving a mean Intersection over Union (mIOU) of 88.1% on PASCAL VOC utilizing only 30% of the computation required by prior leading models, and 52.5% on PASCAL Context, indicating marked improvements in computational efficiency while maintaining competitive performance metrics.

The paper details the empirical superiority of DUpsampling over bilinear upsampling, emphasizing its higher reconstruction capability and reduced reliance on high-resolution feature maps. Moreover, DUpsampling's incorporation into the decoder is seamless, leveraging a standard 1x1 convolution, thereby not necessitating complex integration efforts.

The research also proposes an adaptive-temperature softmax function to enhance the learning dynamics of the segmentation process. This technique mitigates the challenges associated with standard softmax when paired with DUpsampling, particularly in optimizing sharp activations critical for pixel-wise classification.

Importantly, the flexibility facilitated by DUpsampling permits arbitrary feature aggregation strategies within the decoder. By decoupling the dependency on feature map resolution, it allows for simplistic and efficient downsampling of low-level features before merging, which broadens the design space for feature integration strategies. This flexibility is crucial in attaining better segmentation outcomes since it is no longer constrained by high-resolution fusion requirements.

This work contributes significantly to reducing computational resource demands in semantic segmentation architectures, presenting a scalable and efficient alternative for real-time applications. Additionally, it sets a precedent for future research to explore further compression strategies in label spaces, potentially leading to more innovations in resource-optimized segmentation models.

The theoretical implications suggest a paradigm shift in how feature aggregations and segmentation predictions are approached, emphasizing the necessity of data-dependent methods over traditional data-independent processes. Practically, the flexibility and reduced computational overhead hinge on optimizing encoder-decoder architectures for deployment in real-world, resource-constrained environments, such as mobile applications and autonomous systems.

In conclusion, this paper demonstrates that by addressing the limitations of existing decoder architectures, particularly through enhanced feature aggregation facilitated by DUpsampling, there is potential for substantial improvements in semantic segmentation tasks. The findings advocate for continued exploration into data-dependent methodologies, with likely future developments in achieving further efficiencies without compromising performance fidelity.

PDF Markdown

Decoders Matter for Semantic Segmentation: Data-Dependent Decoding Enables Flexible Feature Aggregation (1903.02120v3)

Summary

Decoders Matter for Semantic Segmentation: Data-Dependent Decoding Enables Flexible Feature Aggregation

Related Papers