RANet: Ranking Attention Network for Fast Video Object Segmentation (1908.06647v4)

Published 19 Aug 2019 in cs.CV

Abstract: Despite online learning (OL) techniques have boosted the performance of semi-supervised video object segmentation (VOS) methods, the huge time costs of OL greatly restrict their practicality. Matching based and propagation based methods run at a faster speed by avoiding OL techniques. However, they are limited by sub-optimal accuracy, due to mismatching and drifting problems. In this paper, we develop a real-time yet very accurate Ranking Attention Network (RANet) for VOS. Specifically, to integrate the insights of matching based and propagation based methods, we employ an encoder-decoder framework to learn pixel-level similarity and segmentation in an end-to-end manner. To better utilize the similarity maps, we propose a novel ranking attention module, which automatically ranks and selects these maps for fine-grained VOS performance. Experiments on DAVIS-16 and DAVIS-17 datasets show that our RANet achieves the best speed-accuracy trade-off, e.g., with 33 milliseconds per frame and J&F=85.5% on DAVIS-16. With OL, our RANet reaches J&F=87.1% on DAVIS-16, exceeding state-of-the-art VOS methods. The code can be found at https://github.com/Storife/RANet.

Authors (5)

Ziqin Wang (8 papers)
Jun Xu (398 papers)
Li Liu (311 papers)
Fan Zhu (44 papers)
Ling Shao (244 papers)

Citations (193)

View on Semantic Scholar

Summary

The paper introduces a novel Ranking Attention Module (RAM) that selects and ranks similarity maps to boost segmentation performance.
It uses a Siamese network encoder and end-to-end training to combine matching and propagation strategies for efficient real-time processing.
Experimental results on DAVIS datasets demonstrate impressive precision, achieving a 85.5% J&F score without online learning and 87.1% with it.

Overview of RANet: Ranking Attention Network for Fast Video Object Segmentation

The paper presents an innovative approach to video object segmentation (VOS) with the introduction of the Ranking Attention Network (RANet). This work addresses a significant limitation in existing state-of-the-art methods, which often require extensive computations due to online learning (OL) techniques. By combining aspects of both matching-based and propagation-based strategies, RANet seeks to offer a fast and efficient solution that excels in both speed and accuracy.

Technical Contributions

RANet leverages an encoder-decoder framework that learns pixel-level matching and segmentation directly. A key innovation in this architecture is the Ranking Attention Module (RAM), which intelligently selects and ranks similarity maps to enhance segmentation performance. This method dilutes the drawbacks of previous matching and propagation approaches, which commonly face issues such as mismatching and drift.

Key features include:

Siamese Network Encoder: Utilizes the Siamese network approach for efficient feature extraction from video frames. The encoder supports pixel-level similarity calculations across temporal frames in a video sequence.
Novel RAM Module: The RAM module ranks and selects critical similarity maps, boosting the segmentation resolution by leveraging high-fidelity data from the video stream.
End-to-End Training: Unlike previous methods reliant heavily on online adjustment during inference, RANet integrates training on static images and video sequences to fine-tune the network for semi-supervised VOS.

Experimental Results

The performance of RANet is validated using the DAVIS $_{16}$ and DAVIS $_{17}$ datasets. Empirical results reveal that RANet achieves remarkable efficiency, processing frames at a rate of 30 FPS. It manages to strike a noteworthy balance between processing speed and precision, outperforming several benchmarks in terms of the $\mathcal{J}%%%%2%%%%\mathcal{F}$ metric.

Quantitatively, RANet demonstrates a $\mathcal{J}$ {content} $\mathcal{F}$ score of 85.5% on DAVIS $_{16}$ without OL, which surpasses many existing models that rely significantly on OL techniques for enhancement. With OL, RANet furthers its lead, reaching an improved rate of 87.1%, showcasing its potential to exceed existing state-of-the-art OL-based VOS methods.

Practical and Theoretical Implications

Practically, RANet’s ability to perform video object segmentation at real-time speed opens possibilities for applications in video editing, augmented reality, and autonomous systems that require quick processing times. The underlying RAM approach could be repurposed in a range of computer vision tasks where ranking attention within feature maps is essential.

Theoretically, RANet’s framework provides a blueprint for integrating temporal and similarity data in video streams efficiently, offering a potential roadmap for future research aimed at reducing computational overheads while maintaining accuracy.

Future Directions

In exploring future horizons, improving the interplay between matching and propagation methods could enhance the model's predictive capabilities in dynamic scenes. Additionally, applying the RAM strategy to other domains, such as stereo vision or real-time object tracking, could further validate its utility and inspire novel architectures in neural network design.

The advancement in fast and precise video segmentation as laid out in this paper represents a significant step forward, prompting continued exploration in optimizing neural networks for high-speed applications without a trade-off in performance.

Related Papers

GitHub

GitHub - Storife/RANet: RANet: Ranking Attention Network for Fast Video Object Segmentation (VOS), ICCV2019 (238 stars)