CornerNet: Detecting Objects as Paired Keypoints (1808.01244v2)

Published 3 Aug 2018 in cs.CV

Abstract: We propose CornerNet, a new approach to object detection where we detect an object bounding box as a pair of keypoints, the top-left corner and the bottom-right corner, using a single convolution neural network. By detecting objects as paired keypoints, we eliminate the need for designing a set of anchor boxes commonly used in prior single-stage detectors. In addition to our novel formulation, we introduce corner pooling, a new type of pooling layer that helps the network better localize corners. Experiments show that CornerNet achieves a 42.2% AP on MS COCO, outperforming all existing one-stage detectors.

Citations (3,388)

View on Semantic Scholar

Summary

The paper introduces an anchor-free, one-stage detector that pairs top-left and bottom-right keypoints to localize objects without using anchor boxes.
It employs a novel corner pooling layer to enhance the detection of keypoints by capturing boundary features, significantly boosting precision.
Extensive evaluations on the MS COCO dataset show that CornerNet achieves a 42.2% mAP and excels at high IoU thresholds for accurate object localization.

Overview of CornerNet: Detecting Objects as Paired Keypoints

The paper "CornerNet: Detecting Objects as Paired Keypoints" by Hei Law and Jia Deng proposes a novel one-stage object detection framework that eschews the traditional use of anchor boxes. Instead, this approach detects objects by identifying and grouping pairs of keypoints, specifically the top-left and bottom-right corners of bounding boxes. The keypoints are detected using a single convolutional neural network (CNN), which is enhanced with a new architectural module called corner pooling.

Introduction

Traditional state-of-the-art object detectors commonly rely on anchor boxes to propose regions of interest. However, this method has significant drawbacks. Anchor boxes introduce a large set of hyperparameters related to the size and aspect ratios of boxes, leading to complications in design and training. Additionally, anchor boxes cause a substantial imbalance between positive and negative samples, as only a minor fraction of the proposed boxes overlap with ground truth objects.

Methodology

CornerNet introduces a paradigm shift by detecting objects as pairs of keypoints—top-left and bottom-right corners of bounding boxes. This new formulation circumvent the need for anchor boxes, addressing their inherent drawbacks. The detection process involves several novel components:

Heatmaps for Corner Detection: CornerNet predicts two sets of heatmaps, each corresponding to the top-left and bottom-right corners for each object category.
Embeddings for Grouping Corners: The network also predicts an embedding vector for each detected corner. Corners that belong to the same object have similar embeddings, enabling the network to group them correctly.
Corner Pooling: To improve the localization of corners, a novel corner pooling layer is introduced. This layer checks for the topmost and leftmost boundaries for the top-left corner, and analogous operations for the bottom-right corner, effectively enhancing the network's ability to precisely locate corners even in the absence of local visual evidence.

Implementation

The backbone of CornerNet is a modified hourglass network. The hourglass network captures features at multiple scales through a series of downsampling and upsampling operations, which include skip connections to incorporate fine details back into the upsampled features. The resulting feature maps are then fed into prediction modules that apply the corner pooling layers to detect keypoints.

In the training phase, the network optimizes a custom loss function that combines detection loss, offset loss, and associative embedding losses to ensure accurate detection and grouping of corner keypoints. The training is fully end-to-end, facilitated by intermediate supervisions to improve convergence.

Results

CornerNet is evaluated extensively on the MS COCO dataset, demonstrating superior performance over existing one-stage detectors.

Quantitative Performance: CornerNet achieves a mean average precision (mAP) of 42.2% on the MS COCO test-dev set, surpassing all prior one-stage detectors. Interestingly, it also shows competitive performance relative to state-of-the-art two-stage detectors.
High IoU Evaluation: CornerNet excels at high IoU thresholds, achieving higher precision for stricter accuracy metrics, which signifies its superior bounding box quality.

Ablation Studies

The ablation studies highlight the significance of the novel corner pooling layer and the associative embeddings. Removing the corner pooling layer leads to a notable drop in performance, underscoring its crucial role. Furthermore, the use of embeddings to group corner keypoints effectively prevents incorrect groupings, enabling robust object detection.

Implications and Future Work

The introduction of CornerNet has important implications for the development of efficient and accurate object detectors. By eliminating the dependency on anchor boxes, CornerNet simplifies the detection pipeline and reduces the number of design heuristics. This approach also highlights the potential of associating keypoints to improve detection accuracy, suggesting further exploration in keypoint-based detection frameworks for various applications.

In summary, CornerNet presents a robust and innovative approach to object detection by detecting and grouping keypoints. Its strong performance on the challenging MS COCO dataset marks a significant step forward in developing more efficient and effective object detectors. Future work could extend the principles of CornerNet to other vision tasks, including instance segmentation and human pose estimation, leveraging the advantages of keypoint detection methodologies.

PDF Markdown

Related Papers

GitHub

GitHub - princeton-vl/CornerNet (2,370 stars)

YouTube

Show All Videos