Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
o3 Pro
5 tokens/sec
GPT-4.1 Pro
37 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

Rethinking RGB-D Salient Object Detection: Models, Data Sets, and Large-Scale Benchmarks (1907.06781v2)

Published 15 Jul 2019 in cs.CV

Abstract: The use of RGB-D information for salient object detection has been extensively explored in recent years. However, relatively few efforts have been put towards modeling salient object detection in real-world human activity scenes with RGBD. In this work, we fill the gap by making the following contributions to RGB-D salient object detection. (1) We carefully collect a new SIP (salient person) dataset, which consists of ~1K high-resolution images that cover diverse real-world scenes from various viewpoints, poses, occlusions, illuminations, and backgrounds. (2) We conduct a large-scale (and, so far, the most comprehensive) benchmark comparing contemporary methods, which has long been missing in the field and can serve as a baseline for future research. We systematically summarize 32 popular models and evaluate 18 parts of 32 models on seven datasets containing a total of about 97K images. (3) We propose a simple general architecture, called Deep Depth-Depurator Network (D3Net). It consists of a depth depurator unit (DDU) and a three-stream feature learning module (FLM), which performs low-quality depth map filtering and cross-modal feature learning respectively. These components form a nested structure and are elaborately designed to be learned jointly. D3Net exceeds the performance of any prior contenders across all five metrics under consideration, thus serving as a strong model to advance research in this field. We also demonstrate that D3Net can be used to efficiently extract salient object masks from real scenes, enabling effective background changing application with a speed of 65fps on a single GPU. All the saliency maps, our new SIP dataset, the D3Net model, and the evaluation tools are publicly available at https://github.com/DengPingFan/D3NetBenchmark.

Citations (477)

Summary

  • The paper introduces the SIP dataset and D³Net model, offering a novel framework for robust RGB-D salient object detection.
  • It conducts an extensive evaluation of 32 models on seven datasets comprising 97,000 images to establish a strong performance baseline.
  • D³Net achieves superior results at 65 fps through advanced cross-modal fusion, demonstrating its potential for real-time applications.

Rethinking RGB-D Salient Object Detection: Models, Data Sets, and Large-Scale Benchmarks

The paper "Rethinking RGB-D Salient Object Detection: Models, Data Sets, and Large-Scale Benchmarks" explores the domain of RGB-D data for salient object detection (SOD), primarily focusing on real-world human activity scenes. The authors offer a comprehensive investigation involving the introduction of a novel dataset, a new model, and an extensive benchmarking exercise.

Contributions

  1. SIP Dataset: The authors introduce the SIP (Salient Person) dataset consisting of approximately 1,000 high-resolution images. This dataset captures diverse real-world scenarios with variations in viewpoints, poses, occlusions, and lighting conditions. It is designed specifically to reflect mobile photography environments, which prominently feature human subjects.
  2. Benchmarking and Evaluation: A large-scale benchmark was conducted involving 32 popular models evaluated on seven datasets, collectively comprising around 97,000 images. This benchmark provides a solid baseline for future research in the field.
  3. D³Net Architecture: The paper presents the Deep Depth-Depurator Network (D³Net), which integrates a depth depurator unit (DDU) and a three-stream feature learning module (FLM) to filter low-quality depth maps and facilitate cross-modal feature learning. The authors report that D³Net outperforms previous models across several key metrics.

Strong Results

D³Net demonstrates superior performance across five critical metrics, achieving a processing speed of 65 frames per second on a single GPU. These results underscore its potential practicality for real-time applications.

Implications and Future Directions

The contributions of this paper have significant practical implications, notably enhancing capabilities in intelligent photography and visual detection tasks in mobile devices, autonomous vehicles, and industrial robotics.

The SIP dataset, given its focus on realistic scenarios, also paves the way for more accurate and relevant modeling in the context of mobile imaging applications. Future research may explore advanced architectures and training strategies within the D³Net framework to further elevate performance benchmarks.

Limitations and Future Work

The SIP dataset, while comprehensive, is relatively modest in comparison to large-scale RGB datasets, indicating a need for ongoing expansion. Additionally, while D³Net's three-subnetwork structure offers robust feature extraction, its memory demands could be optimized for lightweight devices using alternative methods like dimension reduction or more efficient network architectures.

Conclusion

This work significantly enhances our understanding and capability in RGB-D SOD, setting a new standard for evaluation and model performance. By bridging the gap between traditional approaches and the complex requirements of real-world scenarios, the authors provide a foundation for future developments in AI-driven image processing systems. The paper's comprehensive benchmarking initiative further catalyzes progress by offering the community a robust platform for evaluating novel models and methodologies.