Reviving Iterative Training with Mask Guidance for Interactive Segmentation (2102.06583v1)

Published 12 Feb 2021 in cs.CV

Abstract: Recent works on click-based interactive segmentation have demonstrated state-of-the-art results by using various inference-time optimization schemes. These methods are considerably more computationally expensive compared to feedforward approaches, as they require performing backward passes through a network during inference and are hard to deploy on mobile frameworks that usually support only forward passes. In this paper, we extensively evaluate various design choices for interactive segmentation and discover that new state-of-the-art results can be obtained without any additional optimization schemes. Thus, we propose a simple feedforward model for click-based interactive segmentation that employs the segmentation masks from previous steps. It allows not only to segment an entirely new object, but also to start with an external mask and correct it. When analyzing the performance of models trained on different datasets, we observe that the choice of a training dataset greatly impacts the quality of interactive segmentation. We find that the models trained on a combination of COCO and LVIS with diverse and high-quality annotations show performance superior to all existing models. The code and trained models are available at https://github.com/saic-vul/ritm_interactive_segmentation.

Authors (3)

Konstantin Sofiiuk (7 papers)
Ilia A. Petrov (1 paper)
Anton Konushin (33 papers)

Citations (186)

View on Semantic Scholar

Summary

Reviving Iterative Training with Mask Guidance for Interactive Segmentation

Overview

In the paper "Reviving Iterative Training with Mask Guidance for Interactive Segmentation," Sofiiuk, Petrov, and Konushin explore the field of interactive segmentation, focusing on the development of a click-based model characterized by its simplicity and efficiency. This model prioritizes feedforward processes over computationally intensive inference-time optimization schemes, demonstrating that state-of-the-art performance can be achieved without these additional complexities. The proposed approach, enhanced by the inclusion of segmentation masks from prior steps, offers both object segmentation and external mask refinement, which is advantageous for practical applications like mobile deployment and photo editing.

Methodology and Key Findings

The researchers emphasize the importance of dataset selection, demonstrating that a combination of COCO and LVIS datasets, known for their diverse and high-quality annotations, facilitates superior model performance. A significant contribution of the paper is the development of an iterative training process that integrates previous segmentation masks, enhancing the stability and accuracy of the model, as evidenced by consistent performance metrics across challenging datasets.

Architectural Choices

The choice of backbone architecture is critical. The authors compare DeepLabV3+ with HRNet+OCR, favoring the latter due to its capacity for high-resolution outputs. They introduce the Conv1S scheme for integrating click information into the model, an approach that surpasses the traditional DMF and Conv1E methods. Additionally, the research confirms that disk-based click encoding is superior to the distance transform method, contributing to more stable model training.

Iterative Sampling and Training

Implementing iterative sampling allows for realistic user interaction simulation during training. This methodology enhances the model's ability to account for prior predictions, leading to improvements in model convergence and reliability. Notably, their iterative model maintains the integrity of segmentation results even with additional user inputs, a common challenge in interactive segmentation.

Evaluation and Results

The model's performance was evaluated using the Number of Clicks (NoC) metric across multiple datasets such as GrabCut, Berkeley, DAVIS, and SBD. The proposed approach consistently outperformed previous state-of-the-art models, with significant improvements noted particularly when the model was initialized with masks from prior steps. The iterative approach yielded robustness in segmentation tasks, achieving higher accuracy with fewer correction inputs.

Implications and Future Directions

This paper provides important insights for the field of interactive segmentation, particularly in developing efficient algorithms for resource-constrained environments like mobile devices. By leveraging iterative training and high-quality data, the proposed method not only improves segmentation accuracy but also enhances user interactivity by allowing mask modification.

Future work could extend these methodologies to further refine user input mechanisms or incorporate additional modalities such as textual guidance. Moreover, the exploration of alternate loss functions and optimization techniques could further enhance the model's application to varied segmentation tasks.

In summary, this paper highlights the significance of a methodical approach to architecture design and dataset utilization in interactive segmentation, demonstrating how a well-conceived feedforward model can offer both simplicity and state-of-the-art performance.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - SamsungLabs/ritm_interactive_segmentation: Reviving Iterative Training with Mask Guidance for Interactive Segmentation (637 stars)