Mask R-CNN (1703.06870v3)

Published 20 Mar 2017 in cs.CV

Abstract: We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition. Code has been made available at: https://github.com/facebookresearch/Detectron

PDF Abstract

The paper introduces Mask R-CNN, a conceptually straightforward and adaptable framework for object instance segmentation. The approach builds upon Faster R-CNN by incorporating a branch for predicting object masks in parallel with the existing bounding box recognition branch. The Mask R-CNN is designed for ease of training, incurring only a slight overhead to Faster R-CNN while operating at 5 fps. Its architecture allows for generalization to other tasks such as human pose estimation within the same framework.

The Mask R-CNN framework addresses instance segmentation by combining object detection and semantic segmentation. It extends the Faster R-CNN framework by adding a branch that predicts segmentation masks for each RoI. This mask branch is implemented as a small FCN applied to each RoI, which predicts segmentation masks in a pixel-to-pixel manner. The method is designed to be easy to implement and train, while adding minimal computational overhead, thus maintaining a fast processing speed.

A critical component of Mask R-CNN is the RoIAlign layer, which addresses the misalignment caused by RoIPool in Faster R-CNN. RoIPool performs spatial quantization for feature extraction, leading to misalignment between network inputs and outputs. RoIAlign, in contrast, avoids quantization, thus preserving exact spatial locations. This leads to significant improvements in mask accuracy, ranging from 10% to 50% relative improvement, particularly under strict localization metrics. Furthermore, the framework decouples mask and class prediction by predicting a binary mask for each class independently, leveraging the RoI classification branch for category prediction.

The method achieves state-of-the-art results on the COCO instance segmentation task, surpassing existing single-model results, including those from the COCO 2016 challenge. It also performs well on the COCO object detection task. Ablation experiments demonstrate the robustness of the method and analyze the effects of core factors. The models can operate at approximately 200ms per frame on a GPU, with training on COCO taking one to two days on a single 8-GPU machine.

The paper showcases the generality of the Mask R-CNN framework through human pose estimation on the COCO keypoint dataset. By treating each keypoint as a one-hot binary mask, Mask R-CNN is adapted to detect instance-specific poses with minimal modification, outperforming the winner of the 2016 COCO keypoint competition while running at 5 fps. This demonstrates Mask R-CNN's potential as a flexible framework for instance-level recognition, capable of being extended to more complex tasks.

The Mask R-CNN framework builds upon the Faster R-CNN detector, which consists of a RPN that proposes candidate object bounding boxes, and a second stage (Fast R-CNN) that extracts features using RoIPool for classification and bounding-box regression. Mask R-CNN adopts the same two-stage procedure, with an identical first stage (RPN). In the second stage, Mask R-CNN outputs a binary mask for each RoI in parallel with predicting the class and box offset.

Formally, during training, a multi-task loss is defined on each sampled RoI as:

$L = L_{cls} + L_{box} + L_{mask}$

Where:

$L_{cls}$ is the classification loss
$L_{box}$ is the bounding-box loss
$L_{mask}$ is the mask loss

The classification loss $L_{cls}$ and bounding-box loss $L_{box}$ are identical to those defined in previous work. The mask branch has a $Km^2$ -dimensional output for each RoI, encoding $K$ binary masks of resolution $m \times m$ , one for each of the $K$ classes. A per-pixel sigmoid is applied, and $L_{mask}$ is defined as the average binary cross-entropy loss. For an RoI associated with ground-truth class $k$ , $L_{mask}$ is only defined on the $k$ -th mask.

The network architecture involves a convolutional backbone for feature extraction and a network head for bounding-box recognition and mask prediction. The backbone architecture is denoted using the nomenclature network-depth-features, with evaluations conducted on ResNet and ResNeXt networks of depth 50 or 101 layers. The original Faster R-CNN with ResNets extracted features from the final convolutional layer of the 4-th stage (C4). An alternative backbone is the FPN, which builds an in-network feature pyramid from a single-scale input. For the network head, the Faster R-CNN box heads from the ResNet and FPN papers are extended with a fully convolutional mask prediction branch.

The Mask R-CNN framework was compared against instance segmentation methods such as MNC and FCIS on the COCO dataset. The models were trained using the union of 80k train images and a 35k subset of val images (trainval35k), and ablations were reported on the remaining 5k val images (minival). The standard COCO metrics, including AP, AP $_{50}$ , AP $_{75}$ , and AP at different scales (AP $_S$ , AP $_M$ , AP $_L$ ), were reported.

Ablation studies were conducted to evaluate the impact of different components of Mask R-CNN. These studies examined the effects of backbone architecture, the use of multinomial versus independent masks, and the RoIAlign layer. The results indicated that deeper networks, FPN, and ResNeXt backbones led to improved performance. Decoupling mask and class prediction via per-class binary masks (sigmoid) gave gains over multinomial masks (softmax). The RoIAlign layer improved AP by approximately 3 points and AP $_{75}$ by approximately 5 points.

Further experiments were performed on the Cityscapes dataset, which contains fine annotations for 2975 train, 500 val, and 1525 test images, as well as 20k coarse training images without instance annotations. The instance segmentation task involves 8 object categories. The Mask R-CNN models were applied with the ResNet-FPN-50 backbone. The models were trained with image scale (shorter side) randomly sampled from [800, 1024], and inference was performed on a single scale of 1024 pixels. The results were compared to state-of-the-art methods on the val and test sets.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Kaiming He (71 papers)
Georgia Gkioxari (39 papers)
Piotr Dollár (49 papers)
Ross Girshick (75 papers)

Citations (25,305)

View on Semantic Scholar

Mask R-CNN (1703.06870v3)

Related Papers

GitHub

YouTube