You Only Look One-level Feature (2103.09460v1)

Published 17 Mar 2021 in cs.CV

Abstract: This paper revisits feature pyramids networks (FPN) for one-stage detectors and points out that the success of FPN is due to its divide-and-conquer solution to the optimization problem in object detection rather than multi-scale feature fusion. From the perspective of optimization, we introduce an alternative way to address the problem instead of adopting the complex feature pyramids - {\em utilizing only one-level feature for detection}. Based on the simple and efficient solution, we present You Only Look One-level Feature (YOLOF). In our method, two key components, Dilated Encoder and Uniform Matching, are proposed and bring considerable improvements. Extensive experiments on the COCO benchmark prove the effectiveness of the proposed model. Our YOLOF achieves comparable results with its feature pyramids counterpart RetinaNet while being $2.5\times$ faster. Without transformer layers, YOLOF can match the performance of DETR in a single-level feature manner with $7\times$ less training epochs. With an image size of $608\times608$, YOLOF achieves 44.3 mAP running at 60 fps on 2080Ti, which is $13\%$ faster than YOLOv4. Code is available at \url{https://github.com/megvii-model/YOLOF}.

Citations (461)

View on Semantic Scholar

Summary

The paper introduces YOLOF, a novel object detector that simplifies detection by using a single-level feature instead of multi-scale fusion.
It leverages a dilated encoder and uniform matching to simulate a multi-level representation and balance positive anchor assignments.
Extensive experiments on COCO reveal YOLOF attains 44.3 mAP at 60 fps, outperforming traditional FPN-based detectors in speed and efficiency.

Overview of "You Only Look One-level Feature" (YOLOF)

The paper "You Only Look One-level Feature" (YOLOF) presents a novel approach to object detection that simplifies the detection process by using a single feature level. This research reevaluates the conventional utility of Feature Pyramid Networks (FPN) in object detection, challenging the prevailing assumption that multi-scale feature fusion is crucial for the effectiveness of one-stage detectors.

Key Contributions

The authors highlight two primary contributions of FPN: multi-scale feature fusion and divide-and-conquer solutions. Through various experiments, they demonstrate that the latter plays a more significant role in the success of FPN by dividing the detection task based on object scales, thus optimizing the detection process. This insight leads to the development of YOLOF, which focuses on using a single-level feature for detection.

Methodology

YOLOF introduces two pivotal components to bridge the performance gap that one might expect from using a single feature level:

Dilated Encoder: By utilizing dilated convolutions, the encoder effectively simulates a multi-level feature representation, enabling a single feature to cover a broad range of object scales.
Uniform Matching: This mechanism resolves the imbalance of positive anchors typically caused by sparse anchoring in single-level settings by ensuring each ground-truth box matches a consistent number of positive anchors across all scales.

Experimental Results

The paper's extensive experiments on the COCO benchmark reveal that YOLOF achieves performance comparable to FPN-based models, such as RetinaNet and DETR, while being considerably faster. For instance, YOLOF is shown to be 2.5 times faster than its FPN counterpart, RetinaNet, and it matches DETR's performance with significantly fewer training epochs. Specifically, YOLOF achieves 44.3 mAP at 60 fps, outperforming YOLOv4 by 13% in speed.

Implications

These findings suggest that the divide-and-conquer approach is central to optimizing object detection tasks. By proving that a single-level feature can suffice, YOLOF paves the way for more streamlined and efficient detectors. This research could influence future designs in object detection, emphasizing faster, simpler architectures without relying heavily on multi-scale features.

Future Directions

The paper identifies potential enhancements through incorporating anchor-free mechanisms to mitigate some detection errors. Furthermore, exploring the integration of YOLOF with other advanced techniques could lead to even more efficient object detection systems.

In summary, YOLOF challenges conventional paradigms in object detection by simplifying the detection process to a single-level feature without significant loss in performance. This research provides a valuable perspective on optimizing detection architectures and introduces a robust baseline for future work in the domain.

PDF Markdown

Related Papers

GitHub

GitHub - megvii-model/YOLOF (806 stars)

Tweets

https://twitter.com/arankomatsuzaki/status/1372349396164182016

https://twitter.com/_akhaliq/status/1372347313734189061