A Single Stream Network for Robust and Real-time RGB-D Salient Object Detection (2007.06811v2)

Published 14 Jul 2020 in cs.CV

Abstract: Existing RGB-D salient object detection (SOD) approaches concentrate on the cross-modal fusion between the RGB stream and the depth stream. They do not deeply explore the effect of the depth map itself. In this work, we design a single stream network to directly use the depth map to guide early fusion and middle fusion between RGB and depth, which saves the feature encoder of the depth stream and achieves a lightweight and real-time model. We tactfully utilize depth information from two perspectives: (1) Overcoming the incompatibility problem caused by the great difference between modalities, we build a single stream encoder to achieve the early fusion, which can take full advantage of ImageNet pre-trained backbone model to extract rich and discriminative features. (2) We design a novel depth-enhanced dual attention module (DEDA) to efficiently provide the fore-/back-ground branches with the spatially filtered features, which enables the decoder to optimally perform the middle fusion. Besides, we put forward a pyramidally attended feature extraction module (PAFE) to accurately localize the objects of different scales. Extensive experiments demonstrate that the proposed model performs favorably against most state-of-the-art methods under different evaluation metrics. Furthermore, this model is 55.5\% lighter than the current lightest model and runs at a real-time speed of 32 FPS when processing a $384 \times 384$ image.

PDF Abstract

A Single Stream Network for Robust and Real-time RGB-D Salient Object Detection

The paper addresses the complexities surrounding RGB-D Salient Object Detection (SOD), a task critical for discerning and segmenting visually significant regions within an image using both RGB and depth information. Traditionally, RGB-D SOD methods have employed dual-stream architectures, leveraging separate RGB and depth streams for feature extraction and subsequent fusion. However, this approach often results in increased model parameters and suboptimal use of the depth modality due to the inherent differences between the two data types.

In response, the authors propose a novel single stream network that integrates depth information directly into the RGB stream to guide both early and middle fusion processes. This method circumvents the need for a dual feature encoder, thereby achieving significant reductions in model size and computational overhead, allowing for real-time application at 32 frames per second (FPS) at a $384 \times 384$ resolution.

Key Contributions and Methodology

Single Stream Encoder: The architecture leverages a single encoder that utilizes a 4-channel input comprising RGB and depth data. By employing a pre-trained ImageNet backbone, the single stream encoder helps in extracting rich feature representations essential for SOD. This approach facilitates an early fusion of modalities, which is argued to be more effective in leveraging the discriminative ability of the network compared to concatenation strategies that require training from scratch.
Depth-Enhanced Dual Attention Module (DEDA): The introduction of DEDA marks a significant innovation in handling multi-modal data. This module strategically exploits depth cues to enhance spatial discrimination, particularly for foreground and background segmentation. The proposed dual attention mechanism aids in filtering noise and improving focus on salient details by examining the spatial interplay between depth and RGB information.
Pyramidally Attended Feature Extraction (PAFE): To tackle the challenge of detecting objects of varying scales, PAFE is employed to attend to feature maps at different scales. This module, inspired by non-local means attention mechanisms, enhances the representation of multi-scale contextual information, improving the localization of objects within the scene.

Experimental Results

The proposed model is rigorously tested across six benchmark datasets, namely NJUD, RGBD135, NLPR, SSD, DUTLF-D, and SIP. The model demonstrates superior performance, surpassing existing state-of-the-art methods with notable gains in metrics such as F-measure, S-measure, and E-measure. Specifically, the model achieves competitive results even in datasets characterized by complex scenes and challenging background interference, showcasing its robustness.

Model Efficiency

One of the standout achievements of this paper is the reduction in model size, with the proposed architecture being 55.5% lighter than the lightest comparable model, DMRA, which underscores its efficiency. This enables faster inference speeds, crucial for applications requiring real-time processing like robotics and autonomous systems.

Implications and Future Directions

The integration of depth data into a single stream for SOD not only optimizes existing architectures but also opens avenues for further research into hybrid modalities beyond RGB-D. Future work could explore extensions to this methodology by incorporating additional sensory inputs or improving adaptive mechanisms for environments where depth quality is inconsistent.

In conclusion, the paper presents a resource-efficient, high-performing model for RGB-D SOD, emphasizing the strategic integration of depth data through an innovative architecture. It sets a new benchmark in terms of both model efficiency and detection performance, with potential implications for advancing real-time visual processing systems.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Xiaoqi Zhao (25 papers)
Lihe Zhang (40 papers)
Youwei Pang (25 papers)
Huchuan Lu (199 papers)
Lei Zhang (1689 papers)

Citations (175)

View on Semantic Scholar

A Single Stream Network for Robust and Real-time RGB-D Salient Object Detection (2007.06811v2)