Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection (2006.04388v1)

Published 8 Jun 2020 in cs.CV

Abstract: One-stage detector basically formulates object detection as dense classification and localization. The classification is usually optimized by Focal Loss and the box location is commonly learned under Dirac delta distribution. A recent trend for one-stage detectors is to introduce an individual prediction branch to estimate the quality of localization, where the predicted quality facilitates the classification to improve detection performance. This paper delves into the representations of the above three fundamental elements: quality estimation, classification and localization. Two problems are discovered in existing practices, including (1) the inconsistent usage of the quality estimation and classification between training and inference and (2) the inflexible Dirac delta distribution for localization when there is ambiguity and uncertainty in complex scenes. To address the problems, we design new representations for these elements. Specifically, we merge the quality estimation into the class prediction vector to form a joint representation of localization quality and classification, and use a vector to represent arbitrary distribution of box locations. The improved representations eliminate the inconsistency risk and accurately depict the flexible distribution in real data, but contain continuous labels, which is beyond the scope of Focal Loss. We then propose Generalized Focal Loss (GFL) that generalizes Focal Loss from its discrete form to the continuous version for successful optimization. On COCO test-dev, GFL achieves 45.0\% AP using ResNet-101 backbone, surpassing state-of-the-art SAPD (43.5\%) and ATSS (43.6\%) with higher or comparable inference speed, under the same backbone and training settings. Notably, our best model can achieve a single-model single-scale AP of 48.2\%, at 10 FPS on a single 2080Ti GPU. Code and models are available at https://github.com/implus/GFocal.

PDF Abstract

Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection

The paper "Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection," authored by Xiang Li et al., addresses the shortcomings in the prevalent methods used for dense object detection, specifically focusing on the quality estimation, classification, and localization components. The authors identify two primary issues with existing practices: the inconsistent use of quality estimation and classification between training and testing phases, and the limitations of Dirac delta distribution for localization in scenarios involving ambiguity and uncertainty.

Problem Statement

The paper outlines the current trend in one-stage object detectors, which optimize classification through Focal Loss and view bounding box localization as Dirac delta distributions. However, current methods inconsistently handle the interaction between quality estimation and classification during training and inference phases. This gap leads to less reliable predictions, particularly in the presence of occlusions and ambiguous boundaries.

Contributions

To mitigate these issues, the authors propose enhancements in bounding box representation and the optimization process that facilitate more reliable and accurate object detection. Specifically, they introduce:

Joint Representation: Integrating localization quality and classification into a unified vector, which remains consistent across training and testing phases.
General Localization Distribution: Utilizing a vector to represent the arbitrary distribution of bounding box locations instead of rigid Dirac delta distributions.
Generalized Focal Loss (GFL): Extending Focal Loss from discrete to continuous labels for effective optimization of these new representations.

Key Methods

Joint Representation

The novel joint representation merges class prediction and localization quality into a single vector. This method ensures that the classification score directly reflects localization quality, thereby reducing inconsistencies and improving the correlation between classification confidence and localization accuracy.

Generalized Localization Distribution

The paper proposes a transition from Dirac delta distribution to a more generalized approach that discretizes the continuous space of bounding box localization. This method accounts for distributional flexibility and variance in true object boundaries, producing a more representative output in complex scenes.

Generalized Focal Loss (GFL)

To optimize the proposed representations, the authors extend Focal Loss to Generalized Focal Loss (GFL), applicable to continuous labels. GFL encompasses Quality Focal Loss (QFL) and Distribution Focal Loss (DFL). QFL optimizes the joint representation by focusing on hard examples while producing continuous quality estimations for corresponding categories. DFL focuses on learning the probabilities around accurate bounding box locations, ensuring the learned distributions effectively capture underlying data characteristics.

Experimental Results

On the COCO dataset, the proposed GFL achieves notable improvements, surpassing state-of-the-art methods. Using a ResNet-101 backbone, GFL achieves an Average Precision (AP) of 45.0% on COCO test-dev, outperforming SAPD at 43.5% and ATSS at 43.6%, with comparable or better inference speed. Their highest performing model reaches 48.2% AP with 10 FPS on a single 2080Ti GPU.

Implications and Future Directions

The developed Generalized Focal Loss and the associated representations yield both practical and theoretical advancements. Practically, this approach offers an improved balance between detection accuracy and computational efficiency, demonstrating significant potential for real-time applications. Theoretically, the robustness introduced via continuous label optimization and flexible distribution modeling can generalize to other tasks in computer vision and machine learning.

Future research could explore extending the joint representation and GFL to other detection frameworks and further enhancing computational efficiency. Moreover, investigating the integration of these methods with emerging backbone architectures and feature pyramid designs could push the boundaries of object detection performance even further.

Conclusion

This paper presents a substantial enhancement in the methodology of dense object detectors by addressing crucial inconsistencies and inflexibilities in current practices. The proposed Generalized Focal Loss method, encompassing QFL and DFL, optimizes novel representations that allow for higher precision and reliability in object detection tasks. The superior performance metrics on established benchmarks underscore the method's efficacy and potential for broad applicability in the field.