RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free (1901.03353v1)
Abstract: Recently two-stage detectors have surged ahead of single-shot detectors in the accuracy-vs-speed trade-off. Nevertheless single-shot detectors are immensely popular in embedded vision applications. This paper brings single-shot detectors up to the same level as current two-stage techniques. We do this by improving training for the state-of-the-art single-shot detector, RetinaNet, in three ways: integrating instance mask prediction for the first time, making the loss function adaptive and more stable, and including additional hard examples in training. We call the resulting augmented network RetinaMask. The detection component of RetinaMask has the same computational cost as the original RetinaNet, but is more accurate. COCO test-dev results are up to 41.4 mAP for RetinaMask-101 vs 39.1mAP for RetinaNet-101, while the runtime is the same during evaluation. Adding Group Normalization increases the performance of RetinaMask-101 to 41.7 mAP. Code is at:https://github.com/chengyangfu/retinamask
Summary
- The paper introduces RetinaMask with an added instance mask prediction head that elevates single-shot detection to match two-stage techniques.
- It employs a novel self-adjusting Smooth L1 loss and a best matching policy to improve training robustness and include more positive examples.
- Experiments on the COCO dataset demonstrate improved mAP performance (e.g., 41.4 mAP vs. 39.1 mAP for RetinaNet-101) without increasing inference time.
The paper introduces RetinaMask, an augmented single-shot detector network that improves training for state-of-the-art single-shot detectors such as RetinaNet. The network incorporates instance mask prediction, an adaptive loss function, and additional hard examples during training to improve the accuracy of single-shot detectors, bringing them to the level of current two-stage techniques.
The three contributions of this paper are:
- A novel instance mask prediction head is added to the single-shot RetinaNet detector during training.
- A new self-adjusting loss function that improves robustness during training.
- Including more of the positive examples in training, even those with low overlap.
The improved accuracy of single-shot detectors can be applied to embedded applications that require speed and limited computational resources. The computational cost of the detection component of RetinaMask is the same as the original RetinaNet, but more accurate. On the COCO test-dev dataset, RetinaMask-101 achieves 41.4 mAP versus 39.1 mAP for RetinaNet-101, while the runtime is the same during evaluation. Adding Group Normalization increases the performance of RetinaMask-101 to 41.7 mAP.
Model
The paper starts with the RetinaNet settings in Detectron and rebuilds it in PyTorch to form the baseline. Then, the following modifications are introduced to the baseline settings: best matching policy, and modified bounding box regression loss. Finally, the paper describes how to add the mask prediction module on top of RetinaNet.
Best Matching Policy
In the bounding box matching stage, the RetinaNet policy is as follows: all anchor boxes that have an intersection-over-union (IOU) overlap with a ground truth object greater than 0.5 are considered positive examples. If the overlap is less than 0.4, the anchor boxes are assigned a negative label. All anchors for which the overlap falls between 0.4 and 0.5 are not used in the training. The paper proposes to find the best matching anchor box for each of these ground truth boxes, relaxing the overlapping IOU threshold.
Self-Adjusting Smooth L1 Loss
The Smooth L1 Loss splits the positive axis range into two parts: L2 loss is used for targets in range [0,β], and L1 loss is used beyond β to avoid over-penalizing outliers. The control point (β) is chosen heuristically using hyper parameter search.
f(x)={</p><p>0.5βx<sup>2,</sup>amp;if ∣x∣lt;β ∣x∣−0.5β,amp;otherwise
Where:
- x is the input value
- β is the control point that splits the loss function
The paper proposes an improved version of Smooth~L1 called Self-Adjusting Smooth~L1 Loss. Inside the loss function, the running mean and variance of the absolute loss are recorded. The running minibatch mean and variance with momentum=0.9 is used to update these two parameters. Then, the parameters are used to calculate the control point. Specifically, the control point is chosen to be equal to the difference between the running mean and running variance (μR−σR2), and the value is clipped to a range [0,β^].
μB=n1∑i=1n∣xi∣, σB2=n1i=1∑n(∣xi∣−μB)2
μR=μR∗m+μB∗(1−m)
σR2=σR2∗m+σB2∗(1−m)
β=max(0,min(β^,μR−σR2))
Where:
- μB is the minibatch mean of the absolute loss
- n is the number of elements in the minibatch
- xi is the i-th input in the minibatch
- σB2 is the minibatch variance of the absolute loss
- μR is the running mean
- m is the momentum
- σR2 is the running variance
- β is the control point for the self-adjusting smooth L1 Loss
- β^ is the upper bound used for clipping the control point
Mask Prediction Module
To add the mask prediction module, single-shot detection predictions are treated as mask proposals. After running RetinaNet for bounding box predictions, the top N scored predictions are extracted. Then, these mask proposals are distributed to sample features from the appropriate layers of the Feature Pyramid Network (FPN) according to the following equation:
k=⌊k0+log2wh/224⌋
Where:
- k is the feature map Pk to sample from for predicting the instance mask
- k0=4
- w,h are the width and height of the detection
In the final model, the paper uses the {P3, P4, P5, P6, P7 } feature layers for bounding box predictions and {P3, P4, P5} feature layers for mask prediction.
The bounding box classification head consists of 4 convolutional layers (conv3x3(256) + ReLU) and uses 1 convolution (conv3x3(number of anchors * number of classes)) with point-wise sigmoid nonlinearities. For bounding box regression, the paper adopts the class-agnostic setting and runs 4 convolutional layers (conv3x3(256) + ReLU) and 1 output layer (conv3x3(number of anchors * 4)) to refine the anchors. Once the bounding boxes are predicted, they are aggregated and distributed to the Feature Pyramid layers. The ROI-Align operation is performed at the assigned feature layers, yielding 14x14 resolution features, which are fed into 4 consequent convolutional layers (conv3x3), and a single transposed convolutional layer (convtranspose2d 2x2) that upsamples the map to 28x28 resolution. Finally, a prediction convolutional layer (conv1x1) is applied to predict class-specific masks.
Training
To train RetinaNet, the paper follows the settings in the original paper. Images are resized to make the shorter side equal to 800 pixels, while limiting the longer side to 1333 pixels.
The paper uses batch size of 16 images, weight decay 10−4, momentum $0.9$, and train for 90k iterations with the base learning rate of $0.01$, dropped to $0.001$ and $0.0001$ at iterations 60k and 80k. In order to train with the Mask Prediction Module, the number of training iterations is extended by a factor of 1.5x, or 2x, during multi-scale training. The multi-scale training is done at scales {640, 800, 1200}.
The anchor boxes span 5 scales and 9 combinations (3 aspect ratios [0.5, 1, 2] and 3 sizes [20, 21/3 , 22/3 ]), The base anchor sizes range from 322 to 5122 on Feature Pyramid levels P3 to P7. Each anchor box is matched to no more than one ground truth bounding box. The anchors that have intersection-over-union overlap with a ground truth box larger than 0.5 are considered positive examples, and if the overlap is less than 0.4, such anchors are treated as negative examples. Then, the proposed best matching policy is used, which can only add positive examples.
For the Focal Loss function used in classification, α=0.25, γ=2.0 are set, and the prediction logits are initialized according to N(0,0.01) distribution.
FL=−αt(1−pt)γlog(pt)
Where:
- FL is the focal loss
- αt is a weighting factor
- pt is the predicted probability for the class
- γ is a focusing parameter
For the bounding box regression the paper adds the proposed Self-Adjusting Smooth L1 and limit the control point to the range [0,0.11] (β^=0.11).
For each image during training, the paper also runs suppression and top-100 selection of the predicted boxes (the same processing as single-shot detectors apply during inference). Then, ground truth boxes are added to the proposals set, and the mask prediction module is run. The final loss function is a sum of the three losses: LossboxCls+LossboxReg+Lossmask.
Inference
During the bounding box inference, a confidence threshold of $0.05$ is used to filter out predictions with low confidence. Then, the top 1000 scoring boxes are selected from each prediction layer. Non-maximum suppression is applied with threshold $0.4$ for each class separately. Finally, the top-100 scoring predictions are selected for each image. For mask inference, the top 50 bounding box predictions are used as mask proposals.
Experiments
The paper uses the COCO dataset, using the COCO trainval135k split for training and the minival (remaining 5k images from 2014val 40k) for evaluation.
Ablation Study
The effectiveness of using Best Matching Policy for all ground truth objects is tested. According to the experiments, using best matching anchors with any positive overlap to ground truth gives the best performance. The results of running the Smooth~L1 with fixed values (1.0 and 0.11) are shown, and the results of using Self-Adjusting Smooth L1 loss are displayed. The results of the Self-Adjusting loss with setting 0.11 give the best results for every metric.
The bounding box accuracy improvement is illustrated when running multi-task training with instance segmentation. When training with mask prediction using {P3, P4, P5}, the experiment shows 0.7~mAP improvement. If trained with 1.5x schedule, the improvement is 0.9~mAP.
Comparison to RetinaNet
The model evaluated in this section incorporates all three components: Best Matching policy, Self-Adjusting Smooth L1, and Mask Prediction head. ResNet-50 is used as the backbone architecture, and images are resized to a shorter side of 800 pixels. No data augmentation is used.
The per-class difference of the mean Average Precision is analyzed, showing improvement in most of the classes.
Comparisons of the model to RetinaNet on different backbone networks and input resolutions is performed. The model shows better accuracy for all combinations of backbone network choices and resolutions.
Comparisons to the state-of-the-art methods
The paper uses ResNet-50-FPN, ResNet-101-FPN, and ResNeXt32x8d-101-FPN as the backbones in the final models and trains with the multi-scale {640, 800, 1200} and 2x iterations schedule. For the ResNet-101-FPN model, the paper also trains a version using Group Normalization (GN), which is applied only on the extra layers (FPN, localization, and classification). Using ResNeXt32x8d-101-FPN as backbone further improves results by 0.9~mAP and achieves 42.6~ mAP on COCO.
Comparison with Mask R-CNN on instance mask prediction
The mask (instance segmentation) results are compared to Mask R-CNN. All the results use ResNet-101 and Feature Pyramid Network as the backbone model. The models are trained in a very similar fashion to the +e2e training. Mask R-CNN still shows better accuracy on mask prediction, but the difference is only around 1.2~mAP.
Related Papers
- Focal Loss for Dense Object Detection (2017)
- MimicDet: Bridging the Gap Between One-Stage and Two-Stage Object Detection (2020)
- Light-Head R-CNN: In Defense of Two-Stage Object Detector (2017)
- DR Loss: Improving Object Detection by Distributional Ranking (2019)
- Consistent Optimization for Single-Shot Object Detection (2019)
GitHub
- GitHub - chengyangfu/retinamask: RetinaMask (339 stars)