PT-DETR: Efficient UAV Small Object Detection

Updated 24 April 2026

The paper presents PT-DETR, integrating PADF, MSFRP, and adaptive Focaler-SIoU loss to boost recall and localization for tiny, occluded objects in UAV imagery.
It employs a hybrid transformer framework with multi-scale feature fusion, combining detailed convolutional modules and frequency-domain attention for robust detection.
Experimental results on the VisDrone2019 dataset demonstrate a +1.7 mAP gain over RT-DETR, achieving higher precision with reduced parameters and computational cost.

PT-DETR is a small object detection algorithm specifically designed for challenging UAV (unmanned aerial vehicle) imagery, characterized by complex backgrounds, severe occlusion, dense object arrangements, and highly variable illumination. It is an architectural extension of RT-DETR (Real-Time Detection Transformer, ResNet-18-based), incorporating innovations in backbone feature extraction, multi-scale feature fusion, and adaptive loss design to enhance both recall and localization of small targets, while maintaining lower computational cost and parameter count relative to its predecessor (Huo et al., 30 Oct 2025).

1. Architectural Overview

PT-DETR adopts the hybrid transformer-based detection framework of RT-DETR, with several targeted enhancements for dense small object regions. The model consists of four major components:

Backbone with PADF Module: Replaces standard ResNet-18 blocks with Partially-Aware Detail Focus (PADF) modules, tuned for efficient extraction of local detail essential for small object recognition.
MSFRP Neck: Utilizes a Multi-Scale Feature Refinement Pyramid (MSFRP), incorporating SPDConv for downsampling and the Median-Frequency Feature Fusion (MFFF) module to robustly aggregate spatial, channel, and frequency cues from multiple resolutions.
Transformer Encoder: Retains the Adaptive Interaction Feature Integration (AIFI) and the Cross-scale Feature Fusion Module (CCFM) from RT-DETR, yielding tokens that encode global and cross-scale information.
Transformer Decoder: Initializes queries using highest-IoU tokens (uncertainty-minimal selection) and iteratively refines predictions end-to-end, eliminating the need for NMS.

Key extensions are PADF (backbone), MSFRP with MFFF (neck), and Focaler-SIoU loss (bounding box regression), which jointly increase sensitivity to small-object features and enhance detection robustness.

2. Partially-Aware Detail Focus (PADF) Module

The PADF module is tailored to maximize local detail extraction efficiency with minimal computational overhead. Its pipeline is as follows:

Partial Channel Split: Given input $X\in\mathbb{R}^{C\times H\times W}$ , split as $X=[X_{\rm conv};\,X_{\rm id}]$ where $X_{\rm conv}$ and $X_{\rm id}$ represent the convolved and identity branches, respectively.
Partial Convolution (PConv): Applied to $X_{\rm conv}$ using $3\times3$ convolution, with output $Y_{\rm conv}$ ; merged as $Y_{\rm PConv} = [Y_{\rm conv}; X_{\rm id}]$ to preserve capacity at reduced cost.
Partial Channel Attention (PAT $_{\rm ch}$ ): Applies global average pooling, two FC layers, and sigmoid activation to compute channel-wise attention, scaling feature activations accordingly.
Partial Spatial Attention (PAT $_{\rm sp}$ ): Applies $X=[X_{\rm conv};\,X_{\rm id}]$ 0 convolution and sigmoid-activated spatial attention.
Combined Output: The PADF output is $X=[X_{\rm conv};\,X_{\rm id}]$ 1, integrating local, channel, and spatially focused signals.

This strategy expressly preserves fine-grained semantics and structural cues critical for accurate small target discrimination, while mitigating redundant multiply-accumulate operations (MACs) (Huo et al., 30 Oct 2025).

3. Median-Frequency Feature Fusion (MFFF) and MSFRP Neck

To further boost discriminative power for small-scale objects, the MFFF module within the MSFRP neck operates as a dual-branch fusion unit:

Channel-Frequency Attention Branch:
- Transforms features to frequency domain via FFT.
- Dual-channel weighting (DCAM) applies separate $X=[X_{\rm conv};\,X_{\rm id}]$ 2 convolutions, linearly combines the results, and returns to spatial domain with IFFT.
- Frequency Spatial Attention (FSAM) computes a spatial attention map in the frequency domain, modulating features before inverse transformation.
Global Median Pooling Branch:
- Computes per-channel medians along with average and max pooling, followed by attention-weighting via two $X=[X_{\rm conv};\,X_{\rm id}]$ 3 convolutional layers with nonlinearities.
Fusion:
- Aggregates input feature $X=[X_{\rm conv};\,X_{\rm id}]$ 4 with frequency-based and channel-based outputs: $X=[X_{\rm conv};\,X_{\rm id}]$ 5, maintaining channel dimension with a $X=[X_{\rm conv};\,X_{\rm id}]$ 6 convolution as needed.

The outcome is a robust fusion of global and high-frequency details, amplifying context while selectively strengthening small-scale spatial resolution.

4. Focaler-SIoU Loss for Adaptive Box Regression

PT-DETR introduces Focaler-SIoU, which adaptively weights Intersection-over-Union (IoU) based regression by sample difficulty and explicitly incorporates geometric alignment:

Focaler-IoU Mapping: For IoU $X=[X_{\rm conv};\,X_{\rm id}]$ 7 with thresholds $X=[X_{\rm conv};\,X_{\rm id}]$ 8,

$X=[X_{\rm conv};\,X_{\rm id}]$ 9

with loss $X_{\rm conv}$ 0.

SIoU Loss: $X_{\rm conv}$ 1 combines overlap, center distance, aspect ratio, and angle.
Combined Loss: $X_{\rm conv}$ 2 penalizes high-IoU (hard) samples more and suppresses distractors, enhancing box matching for small and occluded instances.

This loss is integrated into the total training objective:

$X_{\rm conv}$ 3

with typical weights $X_{\rm conv}$ 4, $X_{\rm conv}$ 5, $X_{\rm conv}$ 6.

5. Experimental Results and Quantitative Analysis

Experiments are conducted on the VisDrone2019 dataset (10 UAV classes, dominated by objects $X_{\rm conv}$ 7 px). Key metrics evaluated are mAP@0.5:0.95, [email protected] (mAP50), parameter count, and GFLOPs.

Model	Params (M)	mAP@[0.5:0.95]	mAP50	GFLOPs
RT-DETR-R18	20.09	26.4	36.8	—
PT-DETR	19.79	28.1	38.4	67.5

PT-DETR surpasses RT-DETR by +1.7 mAP@[0.5:0.95] and +1.6 mAP50 with a comparable or reduced parameter footprint. In ablation studies: PADF alone contributes +0.4 mAP50 and reduces 0.6M parameters; MSFRP (including MFFF and SPDConv) adds +1.2 mAP@[0.5:0.95]; and Focaler-SIoU further increases mAP50 by +0.9.

Qualitative findings indicate enhanced recall for tiny and highly occluded classes (e.g., pedestrians, bicycles, traffic signs), with improved localization tightness and reduced duplicate detections—achieved without NMS due to end-to-end token matching.

6. Limitations and Future Directions

While PT-DETR delivers improved recall and localization for small objects, the computational burden of the MSFRP neck moderately increases GFLOPs. Detection of extremely small targets ( $X_{\rm conv}$ 8 px) remains challenging under complex UAV scene conditions.

Anticipated future investigations include dynamic trade-offs between high-resolution feature extraction and context encoding, adoption of lightweight attention mechanisms in the neck, and adaptive, frequency-domain filter learning for better pattern sampling in small object regimes.

7. Significance and Outlook

PT-DETR demonstrates the utility of integrating detail-aware convolutional and attention modules, frequency-domain fusion strategies, and adaptive loss formulations in transformer-based object detection pipelines. Its architecture advances state-of-the-art accuracy on UAV small-object detection benchmarks (with +1.6–1.7% mAP gain over RT-DETR) while lowering model size, illustrating a pathway for robust and efficient detection in densely cluttered remote sensing and aerial imagery (Huo et al., 30 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

PT-DETR: Small Target Detection Based on Partially-Aware Detail Focus (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PT-DETR.