A Deep Spatial Contextual Long-term Recurrent Convolutional Network for Saliency Detection (1610.01708v1)

Published 6 Oct 2016 in cs.CV

Abstract: Traditional saliency models usually adopt hand-crafted image features and human-designed mechanisms to calculate local or global contrast. In this paper, we propose a novel computational saliency model, i.e., deep spatial contextual long-term recurrent convolutional network (DSCLRCN) to predict where people looks in natural scenes. DSCLRCN first automatically learns saliency related local features on each image location in parallel. Then, in contrast with most other deep network based saliency models which infer saliency in local contexts, DSCLRCN can mimic the cortical lateral inhibition mechanisms in human visual system to incorporate global contexts to assess the saliency of each image location by leveraging the deep spatial long short-term memory (DSLSTM) model. Moreover, we also integrate scene context modulation in DSLSTM for saliency inference, leading to a novel deep spatial contextual LSTM (DSCLSTM) model. The whole network can be trained end-to-end and works efficiently when testing. Experimental results on two benchmark datasets show that DSCLRCN can achieve state-of-the-art performance on saliency detection. Furthermore, the proposed DSCLSTM model can significantly boost the saliency detection performance by incorporating both global spatial interconnections and scene context modulation, which may uncover novel inspirations for studies on them in computational saliency models.

Citations (184)

View on Semantic Scholar

Summary

The paper presents DSCLRCN, which integrates CNN-based local feature extraction with DSLSTM to capture both local and global spatial contexts for saliency detection.
It employs scene modulation via a DSCLSTM architecture that refines pixel-level attention using high-level semantic cues.
Experimental results on datasets like SALICON and MIT300 demonstrate superior performance measured by NSS and CC metrics.

Deep Spatial Contextual Long-term Recurrent Convolutional Networks for Saliency Detection

The paper "A Deep Spatial Contextual Long-term Recurrent Convolutional Network for Saliency Detection" introduces an advanced computational model known as DSCLRCN, aimed to enhance the prediction of human visual attention in natural scenes. The proposed DSCLRCN leverages the integration of traditional convolutional neural networks (CNNs) with deep spatial long short-term memory (DSLSTM) to model both local and global contexts.

Methodology and Model Design

The authors develop a novel end-to-end model composed of three primary components:

Local Feature Extraction: Utilizing pre-trained CNNs, the model extracts local image features akin to elements such as color, shape, and objects which are typically associated with revealing attention hotspots.
Global Contextual Integration: A DSLSTM is employed to enrich the spatial context by mimicking cortical lateral inhibition from human vision, propagating information across spatial locations within images. This component allows the model to capture long-range dependencies within the scene, addressed by stacking SLSTMs to form a DSLSTM.
Scene Modulation: The paper further proposes the adoption of scene context modulation through a DSCLSTM architecture. This model builds on the DSLSTM by incorporating a scene feature vector that supplies high-level semantic context to influence saliency inference, which improves individual pixel saliency prediction by refining attention distribution effectively.

Experimental Evaluation

The DSCLRCN was robustly evaluated against contemporary saliency detection methods across multiple benchmark datasets, namely SALICON and MIT300, incorporating tasks of eye fixation prediction. Performance was assessed using several metrics including NSS and CC which objectively reflect the detection accuracy and coherence with actual human fixations. DSCLRCN demonstrated superior performance across all key measures, reinforcing the model's efficacy in leveraging hierarchical and contextual information.

Contributions and Implications

The paper's notable contributions draw attention to the benefit of incorporating both local and globally modulated spatial contexts into saliency models, which iterate beyond the limitations of purely local feature-based techniques. The DSCLSTMs contribution acknowledges the role of high-level scene semantics, a relatively unexplored area in previous saliency models, opening avenues for future research on contextual modulation in visual attention systems.

Future Outlook

The paper offers valuable insights into the adoption of advanced recurrent neural network structures like DSLSTM within vision-related tasks. The demonstrated boost in performance through scene context suggests further refinement could leverage additional forms of context such as temporal sequencing in video data. Moreover, to advance computational saliency models, exploration into parallel or augmented approaches involving vision and natural language context could foster enriched interactive systems.

In conclusion, this work positions the DSCLRCN as a formidable model that effectively narrows the gap between computational predictions and human visual attention mechanisms, providing a potent tool for both theoretical investigations and practical applications in computer vision and artificial intelligence.

PDF Markdown