Siamese Network for RGB-D Salient Object Detection and Beyond (2008.12134v2)

Published 26 Aug 2020 in cs.CV

Abstract: Existing RGB-D salient object detection (SOD) models usually treat RGB and depth as independent information and design separate networks for feature extraction from each. Such schemes can easily be constrained by a limited amount of training data or over-reliance on an elaborately designed training process. Inspired by the observation that RGB and depth modalities actually present certain commonality in distinguishing salient objects, a novel joint learning and densely cooperative fusion (JL-DCF) architecture is designed to learn from both RGB and depth inputs through a shared network backbone, known as the Siamese architecture. In this paper, we propose two effective components: joint learning (JL), and densely cooperative fusion (DCF). The JL module provides robust saliency feature learning by exploiting cross-modal commonality via a Siamese network, while the DCF module is introduced for complementary feature discovery. Comprehensive experiments using five popular metrics show that the designed framework yields a robust RGB-D saliency detector with good generalization. As a result, JL-DCF significantly advances the state-of-the-art models by an average of ~2.0% (max F-measure) across seven challenging datasets. In addition, we show that JL-DCF is readily applicable to other related multi-modal detection tasks, including RGB-T (thermal infrared) SOD and video SOD, achieving comparable or even better performance against state-of-the-art methods. We also link JL-DCF to the RGB-D semantic segmentation field, showing its capability of outperforming several semantic segmentation models on the task of RGB-D SOD. These facts further confirm that the proposed framework could offer a potential solution for various applications and provide more insight into the cross-modal complementarity task.

Citations (195)

View on Semantic Scholar

Summary

The paper presents the JL-DCF framework that leverages a Siamese network to jointly learn RGB and depth features for salient object detection.
The method integrates Joint Learning with Densely Cooperative Fusion to capture cross-modal commonalities and effectively fuse multi-scale features.
Empirical results demonstrate an approximate 2% improvement in maximum F-measure across several benchmarks and extend the approach to RGB-T and video SOD tasks.

Siamese Network for RGB-D Salient Object Detection and Beyond

The paper "Siamese Network for RGB-D Salient Object Detection and Beyond" proposes an innovative architecture called Joint Learning and Densely Cooperative Fusion (JL-DCF) that addresses the task of RGB-D salient object detection (SOD) by leveraging the commonalities and complementarities between RGB and depth modalities. The authors propose a novel application of the Siamese network architecture to simultaneously process both RGB and depth data through a shared network backbone, effectively capturing the cross-modal commonalities for identifying salient objects in scenes. This work distinguishes itself by not relying on separate feature extraction processes for RGB and depth, thereby aiming to avoid the limitations posed by smaller amounts of training data or overly elaborate training processes.

Core Components

Joint Learning (JL): This component leverages the Siamese network with shared weights to jointly learn features from RGB and depth data. The architecture ensures that similar salient features are extracted from both RGB and depth images while reducing the need for an exclusive network for each modality. Deep supervision is applied to the extracted features to ensure robust learning.
Densely Cooperative Fusion (DCF): The DCF module complements the JL by introducing a mechanism for multi-scale and densely connected cross-modal feature fusion. A distinct feature of the DCF module is the cross-modal fusion (CM) module, using explicit element-wise operations (addition and multiplication) to integrate RGB and depth features, enhancing the learned saliency representations.

Empirical Results

The paper reports significant improvements over state-of-the-art methods across several benchmark datasets—NJU2K, NLPR, STERE, RGBD135, LFSD, SIP, and DUT-RGBD—with an increase of approximately 2% in maximum F-measure across multiple datasets. Additionally, the framework is shown to be applicable to other tasks such as RGB-Thermal SOD and Video SOD, demonstrating competitive or superior performance compared with current state-of-the-art approaches.

Theoretical and Practical Implications

The proposed framework moves beyond traditional methods by treating RGB and depth modalities as inherently similar in a saliency detection context, recognizing their shared potential for identifying objects that stand out. This has implications for enhanced performance and efficiency in training models that handle multimodal inputs, particularly in situations where depth information complements RGB data or vice versa.

The capability of JL-DCF to generalize to other modalities highlights its potential in various fields such as autonomous driving, robotics, and surveillance, where multimodal inputs are the norm. The authors demonstrate the versatility of the approach by applying it to RGB-Thermal and Video SOD tasks, showing that the framework can serve as a generalized solution to multimodal detection problems.

Future Directions

Future inquiry might investigate further optimizing the JL-DCF framework by exploring alternative backbone architectures or additional feature fusion strategies that could enhance cross-modal learning. Another potential exploration avenue is investigating adaptive mechanisms for the CM modules to better tailor the network's integration function to different multimodal datasets.

This paper extends the scope of Siamese networks beyond distance learning and matching tasks, revealing their utility in multimodal integration scenarios. The insights derived from this research may inspire further developments in multimodal neural networks, encouraging new approaches to efficiently and effectively process diverse data types.

PDF Markdown