Unified Image and Video Saliency Modeling (2003.05477v3)

Published 11 Mar 2020 in cs.CV and cs.LG

Abstract: Visual saliency modeling for images and videos is treated as two independent tasks in recent computer vision literature. While image saliency modeling is a well-studied problem and progress on benchmarks like SALICON and MIT300 is slowing, video saliency models have shown rapid gains on the recent DHF1K benchmark. Here, we take a step back and ask: Can image and video saliency modeling be approached via a unified model, with mutual benefit? We identify different sources of domain shift between image and video saliency data and between different video saliency datasets as a key challenge for effective joint modelling. To address this we propose four novel domain adaptation techniques - Domain-Adaptive Priors, Domain-Adaptive Fusion, Domain-Adaptive Smoothing and Bypass-RNN - in addition to an improved formulation of learned Gaussian priors. We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and video saliency data. We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and MIT300. With one set of parameters, UNISAL achieves state-of-the-art performance on all video saliency datasets and is on par with the state-of-the-art for image saliency datasets, despite faster runtime and a 5 to 20-fold smaller model size compared to all competing deep methods. We provide retrospective analyses and ablation studies which confirm the importance of the domain shift modeling. The code is available at https://github.com/rdroste/unisal

Authors (3)

Richard Droste (7 papers)
Jianbo Jiao (42 papers)
J. Alison Noble (48 papers)

Citations (146)

View on Semantic Scholar

Summary

Unified Image and Video Saliency Modeling

The paper "Unified Image and Video Saliency Modeling" explores a novel approach to integrate image and video saliency prediction into a single framework. Traditional methods in visual saliency modeling often treat these tasks independently, despite their shared aim of predicting human visual attention. This paper identifies and addresses the domain shifts that hinder joint modeling and proposes a unified model, UNISAL, that achieves state-of-the-art performance on multiple datasets. This essay provides an overview of the methods and findings, and considers the implications for future research in visual saliency prediction.

Core Challenges and Methodology

The authors recognize a key challenge in unified saliency modeling: the domain shift between image and video data and among various video datasets. These differences complicate the sharing of model parameters across domains, as features learned from one domain do not necessarily generalize well to another. To overcome this, the paper introduces several domain adaptation strategies: Domain-Adaptive Priors, Domain-Adaptive Fusion, Domain-Adaptive Smoothing, and Domain-Adaptive Batch Normalization. The model utilizes an encoder (MobileNet V2), an RNN decoder with a Bypass mechanism for temporal sequences, and a custom decoder to combine domain-specific and shared features.

The proposed model is evaluated on several popular benchmarks: the video saliency datasets DHF1K, Hollywood-2, and UCF-Sports, as well as the image saliency datasets SALICON and MIT300. For the image tasks, the model is on par with, or close to, the state-of-the-art performance, demonstrating its applicability to static scenes.

Notable Results

UNISAL achieves state-of-the-art performance on video saliency datasets using a single set of parameters. The model reveals several advantages:

Model Efficiency: Compared to existing methods, UNISAL maintains competitive accuracy while reducing model size by at least fivefold, which translates to faster runtime.
Cross-Domain Training: Training UNISAL simultaneously on multiple datasets enhances its performance on each individual dataset, illustrating the effectiveness of the domain adaptation techniques.
Performance Metrics: UNISAL outperforms previous methods on several metrics, such as AUC-Judd and NSS, demonstrating improved robustness in capturing detailed saliency cues.

Implications and Future Directions

The findings suggest promising implications for computational efficiency and unified modeling of visual attention across modalities. The successful amalgamation of image and video saliency tasks into a single framework could inspire similar integrative approaches in other domains of computer vision, where data modalities have historically been considered separately.

Looking ahead, future investigations could explore expanding this unified approach to incorporate more complex scenes and dynamic environments, potentially involving other sensory inputs. The introduction of more sophisticated domain adaptation layers could further enhance model generalizability, making this a compelling frontier for research in improving human-like understanding in machine learning.

This work, through its meticulous examination of domain shifts and successful integration of adaptive techniques, marks a significant step toward the comprehensive modeling of visual attention. Adopting these findings may pave the way for more efficient real-time applications and contribute to advancing the field of visual saliency prediction.

PDF Markdown

Related Papers

GitHub

GitHub - rdroste/unisal: Unified Image and Video Saliency Modeling (ECCV 2020) (122 stars)

Tweets

https://twitter.com/richard_droste/status/1238457755368742914

YouTube

Show All Videos