TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios (2108.11539v1)

Published 26 Aug 2021 in cs.CV and cs.AI

Abstract: Object detection on drone-captured scenarios is a recent popular task. As drones always navigate in different altitudes, the object scale varies violently, which burdens the optimization of networks. Moreover, high-speed and low-altitude flight bring in the motion blur on the densely packed objects, which leads to great challenge of object distinction. To solve the two issues mentioned above, we propose TPH-YOLOv5. Based on YOLOv5, we add one more prediction head to detect different-scale objects. Then we replace the original prediction heads with Transformer Prediction Heads (TPH) to explore the prediction potential with self-attention mechanism. We also integrate convolutional block attention model (CBAM) to find attention region on scenarios with dense objects. To achieve more improvement of our proposed TPH-YOLOv5, we provide bags of useful strategies such as data augmentation, multiscale testing, multi-model integration and utilizing extra classifier. Extensive experiments on dataset VisDrone2021 show that TPH-YOLOv5 have good performance with impressive interpretability on drone-captured scenarios. On DET-test-challenge dataset, the AP result of TPH-YOLOv5 are 39.18%, which is better than previous SOTA method (DPNetV3) by 1.81%. On VisDrone Challenge 2021, TPHYOLOv5 wins 5th place and achieves well-matched results with 1st place model (AP 39.43%). Compared to baseline model (YOLOv5), TPH-YOLOv5 improves about 7%, which is encouraging and competitive.

Authors (4)

Xingkui Zhu (5 papers)
Shuchang Lyu (21 papers)
Xu Wang (319 papers)
Qi Zhao (181 papers)

Citations (977)

View on Semantic Scholar

Summary

TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios

Introduction

The paper presents TPH-YOLOv5, an enhanced variant of YOLOv5 specifically tailored for object detection in drone-captured imagery. The challenges posed by drone-captured scenarios, including significant variations in object scale and the presence of numerous, densely-packed objects, serve as crucial motivations for this research. The authors identify limitations in conventional deep learning models due to these factors and propose targeted enhancements to YOLOv5.

Methodology

TPH-YOLOv5 builds on the existing YOLOv5 framework by introducing several novel components designed to address the unique challenges inherent in drone-captured images:

Additional Prediction Head: A new prediction head is added to YOLOv5 to better manage objects of varying scales, particularly enhancing the detection of very small objects, which are prevalent in drone-captured images.
Transformer Prediction Head (TPH): The conventional prediction heads in YOLOv5 are replaced with Transformer Prediction Heads. The use of transformers allows the model to leverage self-attention mechanisms, facilitating the capture of global contextual information and improving the model's ability to handle densely packed objects.
Convolutional Block Attention Module (CBAM): To enable the model to focus on relevant regions within an image, CBAM is integrated into the architecture. This module sequentially generates attention maps across channel and spatial dimensions, refining the feature map and improving the focus on critical areas within complex scenes.
Auxiliary Enhancements: Several strategies including data augmentation, multi-scale testing, and model ensemble techniques are employed to further bolster the performance. Data augmentation techniques such as MixUp, Mosaic, and photometric distortions are utilized to enhance the model’s robustness. Multi-scale testing and ensemble methods such as Weighted Boxes Fusion (WBF) provide more reliable and accurate final detections.
Self-trained Classifier: Addressing the classification challenges, a ResNet18-based classifier is trained using cropped image patches from the dataset to supplement the object detection model, thus improving classification accuracy for similar categories like "tricycle" and "awning-tricycle."

Experimental Results

The empirical results on the VisDrone2021 dataset demonstrate the substantial improvements achieved by TPH-YOLOv5. Noteworthy highlights include:

On the DET-test-challenge dataset, TPH-YOLOv5 achieves an Average Precision (AP) of 39.18%, surpassing the prior State-Of-The-Art (SOTA) method (DPNetV3) by 1.81%.
In the VisDrone Challenge 2021, TPH-YOLOv5 secures the 5th place, with a performance (AP 39.43%) closely matched to the 1st place model, only a minor gap away.
Compared to the baseline YOLOv5 model, TPH-YOLOv5 improves the AP by approximately 7%.

These results underscore the effectiveness of the modifications introduced. The inclusion of an additional prediction head and the adoption of transformer-based heads significantly enhance the model’s capability to detect small and densely-packed objects in drone-captured images. CBAM contributes to the model’s ability to focus on relevant areas within an image, thereby improving the detection accuracy.

Implications and Future Directions

The enhanced performance of TPH-YOLOv5 highlights several theoretical and practical implications. Theoretically, the successful integration of transformer-based prediction heads into the conventional YOLO framework presents a promising direction for future research. The ability of transformers to capture long-range dependencies and contextual information can be further explored to enhance various computer vision tasks.

Practically, the improvements have significant implications for applications involving drone-captured imagery, such as urban surveillance, wildlife monitoring, and agricultural management. The robust detection capabilities of TPH-YOLOv5 can lead to more accurate and reliable insights in these domains.

Looking forward, future research may explore the following avenues:

Optimization of Transformer Integration: Further refining the use of transformers within object detection models to balance performance gains with computational efficiency.
Enhanced Data Augmentation Techniques: Developing more sophisticated data augmentation strategies that better simulate the diverse and dynamic conditions of drone-captured scenarios.
Cross-domain Applications: Extending the enhanced detection techniques to other areas of computer vision, such as autonomous driving or satellite imagery analysis.

In conclusion, TPH-YOLOv5 sets a strong precedent for leveraging advanced neural network architectures to tackle the unique challenges posed by drone-captured imagery, paving the way for more accurate and efficient object detection solutions in this rapidly evolving field.

PDF Markdown

Related Papers

Find Related Papers