Specificity-preserving RGB-D Saliency Detection (2108.08162v2)

Published 18 Aug 2021 in cs.CV

Abstract: Salient object detection (SOD) on RGB and depth images has attracted more and more research interests, due to its effectiveness and the fact that depth cues can now be conveniently captured. Existing RGB-D SOD models usually adopt different fusion strategies to learn a shared representation from the two modalities (\ie, RGB and depth), while few methods explicitly consider how to preserve modality-specific characteristics. In this study, we propose a novel framework, termed SPNet} (Specificity-preserving network), which benefits SOD performance by exploring both the shared information and modality-specific properties (\eg, specificity). Specifically, we propose to adopt two modality-specific networks and a shared learning network to generate individual and shared saliency prediction maps, respectively. To effectively fuse cross-modal features in the shared learning network, we propose a cross-enhanced integration module (CIM) and then propagate the fused feature to the next layer for integrating cross-level information. Moreover, to capture rich complementary multi-modal information for boosting the SOD performance, we propose a multi-modal feature aggregation (MFA) module to integrate the modality-specific features from each individual decoder into the shared decoder. By using a skip connection, the hierarchical features between the encoder and decoder layers can be fully combined. Extensive experiments demonstrate that our~\ours~outperforms cutting-edge approaches on six popular RGB-D SOD and three camouflaged object detection benchmarks. The project is publicly available at: https://github.com/taozh2017/SPNet.

Authors (5)

Tao Zhou (398 papers)
Deng-Ping Fan (88 papers)
Geng Chen (115 papers)
Yi Zhou (438 papers)
Huazhu Fu (185 papers)

Citations (139)

View on Semantic Scholar

Summary

Specificity-preserving RGB-D Saliency Detection

The paper "Specificity-preserving RGB-D Saliency Detection" introduces a novel framework termed SP-Net, designed to enhance RGB-D saliency detection by preserving modality-specific characteristics while also leveraging shared information between the RGB and depth modalities. This approach addresses a prevalent issue in existing models, which often focus solely on learning shared representations from RGB and depth data, potentially neglecting the unique properties intrinsic to each modality.

SP-Net's architecture is distinctive due to its dual approach in handling color (RGB) and depth data. It utilizes modality-specific networks aimed at capturing unique features of each modality. Concurrently, a shared learning network interlinks these through a Cross-Enhanced Integration Module (CIM) enabling cross-modal feature enhancement. The CIM is pivotal in refining the model's capacity to integrate RGB and depth information effectively, thereby facilitating accurate and robust saliency detection.

Additionally, the framework incorporates a Multi-modal Feature Aggregation (MFA) module intended to reinforce and integrate modality-specific features into the shared decoder. This integration is crucial for achieving better saliency prediction by providing a comprehensive understanding of the scene captured across different data streams. The skip connections employed between encoder and decoder layers further aid in combining hierarchical features, thereby enhancing the depth and quality of feature representation.

The adoption of a data-driven approach in this paper is evidenced by extensive experiments conducted across various benchmarks, which include six popular RGB-D saliency and three camouflaged object detection datasets. These experiments demonstrate SP-Net's superiority over existing methods, showing improved performance metrics, likely attributed to its innovative feature preservation and integration strategies. Particularly, SP-Net outperforms other methods in terms of metrics such as structure measure ( $S_{\alpha}$ ), mean absolute error ( $\mathcal{M}$ ), enhanced-alignment measure ( $E_{\phi}$ ), and F-measure ( $F_{\beta}$ ).

The practical implications of SP-Net are far-reaching across applications needing precise salient object detection, especially in complex environments with challenging visual conditions. Theoretically, SP-Net's methodological approach opens avenues for further exploration into multi-modal data processing techniques, highlighting the importance of specificity-preserving strategies in fusion networks.

In terms of future developments, the paper suggests investigating lightweight network designs to improve inference time and model size, which remain limitations despite SP-Net's superior detection capabilities. Additionally, this paper's frameworks could extend beyond traditional saliency tasks to more complex scenes and additional modalities.

The paper contributes significantly to the field by emphasizing and convincingly demonstrating the dual necessity of both specificity and shared learning for improving RGB-D saliency detection. Moreover, its extension to camouflaged object detection further underlines the adaptability and robustness of the model. This work also sets a precedence for exploring additional multi-modal strategies that delve into cross-modal interactions at a finer granularity, further enriching the domain of computer vision with sophisticated data integration techniques.

PDF Markdown

Related Papers

GitHub

GitHub - taozh2017/SPNet: Specificity-preserving RGB-D Saliency Detection (46 stars)