Depth Quality-Inspired Feature Manipulation for Efficient RGB-D Salient Object Detection (2107.01779v2)

Published 5 Jul 2021 in cs.CV

Abstract: RGB-D salient object detection (SOD) recently has attracted increasing research interest by benefiting conventional RGB SOD with extra depth information. However, existing RGB-D SOD models often fail to perform well in terms of both efficiency and accuracy, which hinders their potential applications on mobile devices and real-world problems. An underlying challenge is that the model accuracy usually degrades when the model is simplified to have few parameters. To tackle this dilemma and also inspired by the fact that depth quality is a key factor influencing the accuracy, we propose a novel depth quality-inspired feature manipulation (DQFM) process, which is efficient itself and can serve as a gating mechanism for filtering depth features to greatly boost the accuracy. DQFM resorts to the alignment of low-level RGB and depth features, as well as holistic attention of the depth stream to explicitly control and enhance cross-modal fusion. We embed DQFM to obtain an efficient light-weight model called DFM-Net, where we also design a tailored depth backbone and a two-stage decoder for further efficiency consideration. Extensive experimental results demonstrate that our DFM-Net achieves state-of-the-art accuracy when comparing to existing non-efficient models, and meanwhile runs at 140ms on CPU (2.2$\times$ faster than the prior fastest efficient model) with only $\sim$8.5Mb model size (14.9% of the prior lightest). Our code will be available at https://github.com/zwbx/DFM-Net.

Citations (100)

View on Semantic Scholar

Summary

The paper introduces a depth quality-inspired feature manipulation (DQFM) process that selectively filters and fuses depth features for enhanced RGB-D salient object detection.
It proposes a lightweight DFM-Net model that achieves state-of-the-art accuracy, operating at 140ms per frame on a CPU with only 8.5Mb model size.
The study demonstrates that explicit depth quality assessment and boundary alignment improve cross-modal fusion, balancing computational efficiency with improved accuracy.

Depth Quality-Inspired Feature Manipulation for Efficient RGB-D Salient Object Detection

The paper "Depth Quality-Inspired Feature Manipulation for Efficient RGB-D Salient Object Detection" proposes a novel approach to enhance the efficiency and accuracy of salient object detection tasks when using RGB-D inputs. It addresses the commonly observed challenge where model accuracy often degrades when simplifying models to achieve higher efficiency, specifically in the context of RGB-D Salient Object Detection (SOD).

RGB-D SOD has gained significant research attention due to the additional spatial information provided by depth maps, which complement the visual cues from RGB images. Nevertheless, existing models face limitations in both computational efficiency and accuracy, making them less viable for real-world applications, particularly on mobile platforms where resources are constrained.

The research introduces a depth quality-inspired feature manipulation (DQFM) process as a solution. The DQFM process operates by filtering depth features based on their quality before integration into RGB details, thereby controlling cross-modal fusion more effectively. Instead of employing a costly model structure, DQFM offers a gating mechanism that weighs the importance of depth features, inspired by boundary alignment (BA) between RGB and depth information. This approach not only maintains model efficiency but also enhances accuracy.

A significant contribution of the paper is the embedded DQFM within a framework that provides state-of-the-art performance despite being more computationally efficient compared to existing models. It introduces a light-weight model termed as DFM-Net, which integrates a tailored depth backbone and a two-stage decoder, both designed to further amplify efficiency without detrimental impacts on performance.

The experimental results presented are noteworthy, demonstrating that DFM-Net attains superior efficiency, operating at 140ms per frame on a CPU while retaining a model size of only 8.5Mb. This performance indicates that it is 2.2 times faster than the existing fastest efficient model while having just 14.9% of its model size. Moreover, the model achieves state-of-the-art accuracy, surpassing not only other efficient models but also many non-efficient models.

Additionally, the paper explores implications of unstable depth quality on detection algorithms and emphasizes how explicit multi-scale quality assessments can mitigate such challenges. DQFM's focus on holistic attention and alignment-based feature scalp is pivotal in achieving accurate depth informed predictions without incurring heavy computational overhead, a feat that anticipates broader applications of RGB-D SOD in practical scenarios.

In applied contexts, leveraging such an efficient model could have significant implications for mobile device integration, autonomous vehicles, and augmented reality systems where depth sensing is an integral part. Theoretically, this approach reaffirms the value of cross-modal interaction strategies and spatial reasoning in enhancing neural network performance.

Speculating on the future, the efficacy of DQFM and its foundational design suggests opportunities in extending its application to other domains relying on richer sensory inputs, possibly influencing advancements in general computer vision applications where the trade-off between efficiency and accuracy is critical.

In summary, the paper makes a tangible contribution to RGB-D salient object detection through the DQFM process, showcasing potential pathways for future research focused on enhancing cross-modal fusion techniques while maintaining computational parsimony.

PDF Markdown

Related Papers

GitHub

GitHub - zwbx/DFM-Net (47 stars)