M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network (1811.04533v3)

Published 12 Nov 2018 in cs.CV

Abstract: Feature pyramids are widely exploited by both the state-of-the-art one-stage object detectors (e.g., DSSD, RetinaNet, RefineDet) and the two-stage object detectors (e.g., Mask R-CNN, DetNet) to alleviate the problem arising from scale variation across object instances. Although these object detectors with feature pyramids achieve encouraging results, they have some limitations due to that they only simply construct the feature pyramid according to the inherent multi-scale, pyramidal architecture of the backbones which are actually designed for object classification task. Newly, in this work, we present a method called Multi-Level Feature Pyramid Network (MLFPN) to construct more effective feature pyramids for detecting objects of different scales. First, we fuse multi-level features (i.e. multiple layers) extracted by backbone as the base feature. Second, we feed the base feature into a block of alternating joint Thinned U-shape Modules and Feature Fusion Modules and exploit the decoder layers of each u-shape module as the features for detecting objects. Finally, we gather up the decoder layers with equivalent scales (sizes) to develop a feature pyramid for object detection, in which every feature map consists of the layers (features) from multiple levels. To evaluate the effectiveness of the proposed MLFPN, we design and train a powerful end-to-end one-stage object detector we call M2Det by integrating it into the architecture of SSD, which gets better detection performance than state-of-the-art one-stage detectors. Specifically, on MS-COCO benchmark, M2Det achieves AP of 41.0 at speed of 11.8 FPS with single-scale inference strategy and AP of 44.2 with multi-scale inference strategy, which is the new state-of-the-art results among one-stage detectors. The code will be made available on \url{https://github.com/qijiezhao/M2Det.

PDF Abstract

Overview of M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network

The paper "M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network" introduces an advanced methodology aimed at improving object detection, particularly in the context of different scales and complex appearances. The research presents the Multi-Level Feature Pyramid Network (MLFPN) integrated into a one-stage object detector called M2Det, achieving notable performance enhancements over existing methods.

Key Contributions

The paper discusses several innovative aspects of the proposed MLFPN and its implementation in M2Det:

Multi-Level Feature Fusion: The authors propose fusing multiple levels of features extracted by a backbone network like VGG-16 or ResNet-101. This fusion is managed via the Feature Fusion Module (FFMv1), which combines deep and shallow features to create a highly informative base feature.
Thinned U-shape Modules (TUM): The MLFPN employs TUMs, which are thinner versions of the traditional U-shape structure used in networks. These modules generate multi-scale features and are more computationally efficient. They are stacked together, alternating with FFMv2 to effectively leverage multi-level information.
Scale-wise Feature Aggregation Module (SFAM): SFAM refines the multi-level, multi-scale features using a concatenation operation followed by a channel-wise attention mechanism based on the SE-block. This adaptive attention allows the network to focus on the most relevant features for each detection task.

Performance Evaluation

The evaluation of M2Det incorporates extensive experiments on the MS COCO dataset, showcasing its superior performance:

Single-scale Inference:

M2Det achieves an Average Precision (AP) of 41.0 at a speed of 11.8 FPS when employing a single-scale inference strategy.

Multi-scale Inference:

When utilizing multi-scale inference, M2Det reaches an AP of 44.2. These results surpass other state-of-the-art one-stage detectors such as DSSD and RetinaNet.

Detailed Results

The paper provides comprehensive numerical results highlighting the statistical superiority of M2Det:

For the input size of 320x320, M2Det (with a VGG-16 backbone) achieves an AP of 33.5 (single-scale) and 38.9 (multi-scale).
With a deeper backbone (ResNet-101) at the same input size, AP values are elevated to 34.3 (single-scale) and 39.7 (multi-scale).
For larger input sizes (e.g., 512x512 and 800x800), M2Det continues to demonstrate superior performance, achieving APs of up to 44.2 in multi-scale scenarios.

Practical and Theoretical Implications

Practically, M2Det offers significant improvements in object detection efficacy, especially in scenarios requiring real-time processing. Its robust architecture facilitates handling objects of various scales and complexities, which is pivotal for applications in autonomous driving, surveillance, and other vision-dependent industries.

Theoretically, the introduction of MLFPN and its constituent modules (TUM, FFMv1, FFMv2, and SFAM) provide a structured approach for feature extraction and fusion. This design philosophy can influence future innovations in convolutional network architectures, fostering more research into multi-scale feature integration and efficient module design.

Future Directions

The promising results from M2Det pave the way for further advancements in object detection:

Expanding Backbone Diversity: While the research tested VGG-16 and ResNet-101, exploring other architectures like EfficientNet or Hybrid models could yield further improvements.
Real-time Applications: Enhancements in computational efficiency, possibly through hardware acceleration or optimized algorithms, can make M2Det even more suitable for real-time applications.
Extended Visual Recognition Tasks: Applying MLFPN and M2Det principles to other tasks such as instance segmentation or keypoint detection could enhance the performance of these tasks.

In conclusion, the paper details a significant contribution to the field of object detection, specifically through the lens of M2Det and MLFPN. The robust evaluations and significant performance gains underscore the effectiveness of their approach, presenting a valuable paradigm for future research and application in the domain of computer vision.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Qijie Zhao (9 papers)
Tao Sheng (12 papers)
Yongtao Wang (43 papers)
Zhi Tang (32 papers)
Ying Chen (333 papers)
Ling Cai (22 papers)
Haibin Ling (142 papers)

Citations (744)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos