Exploring Cross-Image Pixel Contrast for Semantic Segmentation (2101.11939v4)

Published 28 Jan 2021 in cs.CV

Abstract: Current semantic segmentation methods focus only on mining "local" context, i.e., dependencies between pixels within individual images, by context-aggregation modules (e.g., dilated convolution, neural attention) or structure-aware optimization criteria (e.g., IoU-like loss). However, they ignore "global" context of the training data, i.e., rich semantic relations between pixels across different images. Inspired by the recent advance in unsupervised contrastive representation learning, we propose a pixel-wise contrastive framework for semantic segmentation in the fully supervised setting. The core idea is to enforce pixel embeddings belonging to a same semantic class to be more similar than embeddings from different classes. It raises a pixel-wise metric learning paradigm for semantic segmentation, by explicitly exploring the structures of labeled pixels, which were rarely explored before. Our method can be effortlessly incorporated into existing segmentation frameworks without extra overhead during testing. We experimentally show that, with famous segmentation models (i.e., DeepLabV3, HRNet, OCR) and backbones (i.e., ResNet, HR-Net), our method brings consistent performance improvements across diverse datasets (i.e., Cityscapes, PASCAL-Context, COCO-Stuff, CamVid). We expect this work will encourage our community to rethink the current de facto training paradigm in fully supervised semantic segmentation.

Citations (444)

View on Semantic Scholar

Summary

The paper introduces a pixel-wise contrastive learning method that integrates global context across images to improve semantic segmentation.
It employs a combined loss function merging cross-entropy with a contrastive loss, emphasizing both pixel-to-pixel and pixel-to-region relationships.
Experimental evaluations on Cityscapes, PASCAL-Context, and COCO-Stuff demonstrate enhanced class discrimination and segmentation accuracy.

Exploring Cross-Image Pixel Contrast for Semantic Segmentation

The paper "Exploring Cross-Image Pixel Contrast for Semantic Segmentation" presents a novel approach to enhancing semantic segmentation by leveraging global context across training data. The authors propose a pixel-wise contrastive algorithm, inspired by unsupervised contrastive learning, capable of integrating into existing segmentation frameworks to improve the representation learning without additional computational burden during testing.

Summary of Contributions

This research addresses a fundamental challenge in semantic segmentation: the current focus on local context within individual images, often neglecting the global semantic relations across different images. The authors introduce a supervised pixel-wise contrastive learning methodology that enforces embeddings of similar semantic class pixels to be more similar than those of different classes, thus fostering a structured metric learning paradigm for segmentation.

Methodological Approach

The core innovation is a contrastive learning formulation extended to pixel-wise segmentation settings. The proposed loss function combines traditional pixel-wise cross-entropy with a contrastive loss, computed using both pixel-to-pixel and pixel-to-region relationships. This method utilizes a memory bank to efficiently store embeddings, which significantly enhances the contrastive learning process by providing a diverse set of negative samples in each training step.

Pixel Contrast: The method incorporates inter-image pixel contrast to consider global context, offering significant improvements over intra-image contrast approaches.
Memory Design: A sophisticated memory bank design stores pixel and region embeddings, capturing global context without excessive memory consumption.
Hard Example Sampling: The authors develop advanced sampling strategies to prioritize training on harder examples, ensuring robust feature representation learning.

Experimental Validation

The proposed method was evaluated using state-of-the-art segmentation models such as DeepLabV3, HRNet, and OCR across several challenging datasets, including Cityscapes, PASCAL-Context, and COCO-Stuff. The results show consistent performance improvements, validating the efficacy of integrating global context via pixel-wise contrast.

Cityscapes Dataset: The approach yielded substantial performance gains, surpassing many recent methods.
PASCAL-Context and COCO-Stuff: Demonstrated improved segmentation accuracy by enhancing inter-class discrimination and intra-class compactness in learned embeddings.

Implications and Future Directions

The findings underscore the potential of employing global context to enhance semantic segmentation through advanced metric learning frameworks. This approach not only improves segmentation accuracy but also opens avenues for future research in dense image prediction tasks such as pose estimation and medical imaging.

Further exploration could address intelligent data sampling mechanisms, the development of new loss functions that simultaneously consider higher-order and global contexts, class-balancing strategies during training, and extended applications across other vision tasks.

Concluding, this work marks a significant step towards understanding and leveraging pixel relationships on a global scale, enabling more robust and accurate semantic segmentation solutions that could see far-reaching impacts in the field of computer vision.

PDF Markdown