SOLO: Segmenting Objects by Locations (1912.04488v3)

Published 10 Dec 2019 in cs.CV

Abstract: We present a new, embarrassingly simple approach to instance segmentation in images. Compared to many other dense prediction tasks, e.g., semantic segmentation, it is the arbitrary number of instances that have made instance segmentation much more challenging. In order to predict a mask for each instance, mainstream approaches either follow the 'detect-thensegment' strategy as used by Mask R-CNN, or predict category masks first then use clustering techniques to group pixels into individual instances. We view the task of instance segmentation from a completely new perspective by introducing the notion of "instance categories", which assigns categories to each pixel within an instance according to the instance's location and size, thus nicely converting instance mask segmentation into a classification-solvable problem. Now instance segmentation is decomposed into two classification tasks. We demonstrate a much simpler and flexible instance segmentation framework with strong performance, achieving on par accuracy with Mask R-CNN and outperforming recent singleshot instance segmenters in accuracy. We hope that this very simple and strong framework can serve as a baseline for many instance-level recognition tasks besides instance segmentation.

Authors (5)

Xinlong Wang (56 papers)
Tao Kong (49 papers)
Chunhua Shen (404 papers)
Yuning Jiang (106 papers)
Lei Li (1293 papers)

Citations (628)

View on Semantic Scholar

Summary

SOLO: Segmenting Objects by Locations

The paper "SOLO: Segmenting Objects by Locations" presents an innovative approach to instance segmentation, which significantly simplifies the process compared to existing models. The authors propose a conceptually straightforward yet effective framework that diverges from the traditional "detect-then-segment" paradigm or embedding-based pixel grouping methods.

Overview

SOLO introduces an end-to-end approach where instance segmentation is framed as a location-based classification task. The core idea is to map each pixel to a classification problem, where each class corresponds to a specific location category, defined by the object's center location and size. This transforms the problem into a more tractable task of category assignment using fully convolutional networks (FCNs). The system efficiently segments objects by their spatial locations and dimensions without relying on bounding boxes or extensive post-processing.

Methodology

The SOLO framework divides the image into a grid of S×S cells. Each cell acts as a potential location for an object center. The framework then predicts the instance mask for objects associated with each grid cell. This prediction is augmented by a feature pyramid network (FPN), which handles the variability in object sizes by assigning different objects to different levels of the pyramid. The approach leverages a CoordConv layer to inject spatial information, enhancing positional awareness in the network.

A decoupled variant, "Decoupled SOLO," further optimizes this by separating predictions into two independent axes (horizontal and vertical), which reduces computational redundancy while maintaining performance levels.

Experimental Results

The framework demonstrates competitive performance on the challenging MS COCO dataset, achieving a mask AP of 37.8% with the ResNet-101 backbone. The decoupled variant of SOLO achieves even higher accuracy, with an AP of 40.5% using a ResNet-101 with deformable convolutions. These results surpass many existing one-stage and even several two-stage instance segmentation methods, showcasing SOLO's efficacy.

Implications

SOLO's simplicity and efficiency mark a significant step forward in instance segmentation. Its ability to operate without bounding boxes or complex post-processing is particularly advantageous, reducing the computational overhead associated with traditional methods. Furthermore, it demonstrates strong potential for real-time applications, with variants of the model achieving inference speeds of up to 22.5 FPS.

Future Directions

The paper suggests several areas for future exploration. Further enhancements of the methodology could include leveraging advances in semantic segmentation or exploring more sophisticated spatial relationship modeling to improve accuracy further. Additionally, the ability of SOLO to generalize to tasks beyond instance segmentation, such as instance contour detection, hints at broader applications in object recognition and scene understanding.

In conclusion, SOLO provides a compelling alternative to existing instance segmentation approaches, combining simplicity with strong performance. It represents a versatile tool for various computer vision tasks, potentially serving as a new benchmark in this domain.

PDF Markdown

Related Papers

Mask R-CNN (2017)
SOLO: A Simple Framework for Instance Segmentation (2021)
Mask Encoding for Single Shot Instance Segmentation (2020)
BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation (2020)
PolarMask: Single Shot Instance Segmentation with Polar Representation (2019)

YouTube

Show All Videos