Mutual Information Regularization for Weakly-supervised RGB-D Salient Object Detection (2306.03630v1)

Published 6 Jun 2023 in cs.CV

Abstract: In this paper, we present a weakly-supervised RGB-D salient object detection model via scribble supervision. Specifically, as a multimodal learning task, we focus on effective multimodal representation learning via inter-modal mutual information regularization. In particular, following the principle of disentangled representation learning, we introduce a mutual information upper bound with a mutual information minimization regularizer to encourage the disentangled representation of each modality for salient object detection. Based on our multimodal representation learning framework, we introduce an asymmetric feature extractor for our multimodal data, which is proven more effective than the conventional symmetric backbone setting. We also introduce multimodal variational auto-encoder as stochastic prediction refinement techniques, which takes pseudo labels from the first training stage as supervision and generates refined prediction. Experimental results on benchmark RGB-D salient object detection datasets verify both effectiveness of our explicit multimodal disentangled representation learning method and the stochastic prediction refinement strategy, achieving comparable performance with the state-of-the-art fully supervised models. Our code and data are available at: https://github.com/baneitixiaomai/MIRV.

Citations (16)

View on Semantic Scholar

Summary

The paper introduces a weakly-supervised RGB-D salient object detection method using scribble annotations and inter-modal mutual information regularization to reduce labeling costs.
The framework utilizes an asymmetric feature extractor and a multimodal variational autoencoder for enhanced multimodal learning and prediction refinement.
Experiments show competitive performance on benchmarks, often matching or exceeding fully-supervised methods, validating its effectiveness for cost-sensitive applications.

Mutual Information Regularization for Weakly-supervised RGB-D Salient Object Detection

The research presented in the paper explores an innovative approach to RGB-D salient object detection under weak supervision, specifically using scribble annotations as the guiding factor. The pursuit of deriving salient object detection from RGB-D data under limited supervisory conditions notably addresses the concerns around labeling cost and efficiency. Conventional models rely heavily on pixel-level annotations from large datasets for effective training, but this research circumvents that requirement by leveraging less-intensive scribble annotations, reducing both annotation time and resource expenditure.

Model Architecture and Methodology

The paper introduces a weakly-supervised salient object detection framework that emphasizes multimodal learning through mutual information regularization. The multi-modal learning task embodies the integration of RGB images and depth data to enhance the salient object detection process. A key contribution in this work is the introduction of an inter-modal mutual information regularization technique that supports disentangled representation. This regularization is inspired by disentangled representation strategies and aims to minimize mutual information across modalities, such that the model learns to separate the contribution of RGB and depth inputs more effectively.

In this context, the framework employs an asymmetric feature extractor—a novel approach diverging from traditional symmetric convolutional backbones used in typical saliency detection models. This asymmetry effectively utilizes varying encoding capabilities of different backbones for each modality, improving feature extraction efficacy. Moreover, the framework implements a multimodal variational autoencoder for stochastic prediction refinement, a sophisticated technique that enhances model prediction quality by processing pseudo-labels and improving prediction coherence.

Experimental Validation

Comprehensive experiments validate the model across several benchmark datasets for RGB-D salient object detection. The results indicate competitive performance, generally meeting or exceeding that of state-of-the-art fully-supervised counterparts, thus demonstrating the efficacy of the proposed weakly-supervised method. Key performance metrics, including mean absolute error, F-measure, E-measure, and S-measure, validate the model’s robust performance under weak supervision, underscoring the potential practical applications of this approach in domains where full supervision is impractical or too costly.

Implications and Future Work

The findings have significant implications for the field of computer vision. By effectively reducing the dependency on fully-supervised datasets, the model offers a cost-effective alternative for situations where ground truth data is cumbersome to procure. This paper adds valuable insights into multi-modal learning, particularly in how weak supervision can be amplified using sophisticated machine learning techniques like mutual information regularization.

Moving forward, research can focus on refining these methodologies—enhancing their robustness against noisy or imprecise annotations. Additionally, future work could explore the application of mutual information regularization in other domains that require modality fusion under weak supervision. As deep learning methods continue to evolve, integrating techniques for autoencoder variations with multi-task learning under weak supervision could present further breakthroughs in the field.