FlyPose: Aerial Real-Time Pose Estimation
- FlyPose is a top-down human pose estimator that integrates RT-DETRv2-S for detection and ViTPose for pose estimation in a real-time UAV setting.
- It employs lightweight networks and specialized multi-dataset training protocols to achieve over 50 fps on edge devices, ensuring rapid inference.
- The framework addresses challenges like steep camera angles, frequent occlusions, and small scales, and introduces the FlyPose-104 dataset for evaluation.
FlyPose is a top-down, real-time @@@@1@@@@ framework specifically designed for deployment on unmanned aerial vehicles (UAVs). Addressing the unique challenges of aerial human perception—such as steep camera angles, frequent occlusions, variable altitude, and small person scale—FlyPose introduces a multi-stage pipeline leveraging lightweight detection and pose estimation networks, optimized for rapid inference on edge hardware. Coupled with specialized training protocols, cross-dataset fine-tuning, and the release of a dedicated aerial pose dataset, FlyPose represents a significant advance in human pose estimation from aerial perspectives (Farooq et al., 9 Jan 2026).
1. System Architecture and Processing Pipeline
FlyPose employs a two-stage, top-down pipeline for estimating human pose from UAV imagery:
Stage 1: Person Detection
- Network: RT-DETRv2-S, a single-stage DEtection TRansformer variant with a ResNet-18 backbone pretrained on COCO+Objects365.
- Input: Full-HD (1920×1080) RGB or thermal images from a gimbal-mounted UAV camera.
- Output: bounding boxes , where .
- Losses:
- Classification: focal or cross-entropy for the "person" class.
- Regression: Normalized Wasserstein Distance Loss,
replacing standard Generalized IoU.
- Inference Speed: ∼13 ms per frame (Jetson Orin AGX, TensorRT-FP32).
Stage 2: Pose Estimation
- Network: ViTPose-S, a Vision Transformer-based regressor with 12 layers and a heatmap prediction head.
- Input crop: Each detected box is resized (long edge to 256 px; short edge to 192 px) with zero-padding for aspect-ratio preservation.
- Output: COCO keypoint heatmaps .
- Loss: Per-pixel mean-squared error,
- Post-processing: Non-maximum suppression on heatmaps yields keypoint coordinates.
- Inference Speed: ∼6.5 ms per frame (Jetson Orin AGX, TensorRT-FP32).
Overall, the pipeline achieves ≈19.5 ms per frame total latency, supporting real-time (>50 fps) operation onboard UAVs (Farooq et al., 9 Jan 2026).
2. Network Design and Model Variants
RT-DETRv2-S Person Detector
- Architecture: ResNet-18 backbone (18 convolutional layers; feature maps at strides 4, 8, 16, 32), lightweight transformer encoder–decoder heads (6 layers each).
- Input Resolution: 1280 px on the shorter side during both training and evaluation.
- Loss: Normalized Wasserstein Distance Loss for bounding box regression.
ViTPose-S Pose Estimator
- Backbone: Vision Transformer, 12 layers, 12-head self-attention, hidden dimension 384.
- Patch Size: for input, resulting in sequence tokens.
- Heatmap Head: Deconvolutional upsampling, .
- Variants: S, B, L, and H (12, 24, 32, 48 layers; hidden dims 384, 768, 1024, 1280, respectively).
This modular network design enables optimization for resource-constrained edge devices and flexibility across input modalities.
3. Training Datasets, Augmentation, and Optimization
RT-DETRv2-S Training Protocol
- Pretraining: COCO + Objects365 datasets.
- Fine-tuning – Stage 1: 60 epochs on VisDrone2019-DET (person class only).
- Fine-tuning – Stage 2: Ingest and fine-tune on eight additional aerial datasets, totaling 66,849 training and 21,164 validation images.
- COCO-Person Reintroduction: 50 epochs to preserve performance on frontal, natural-camera images.
- Loss Switch: Final 50 epochs using .
- Optimization: AdamW optimizer (learning rate ≈ ; weight decay ≈ ), batch size 16–32.
ViTPose Fine-Tuning
- Initialization: Pretrained on COCO-Keypoints.
- Augmentation: Half-body, rotation (30°), scaling (30%), and down-scaling (5–20%) for simulating small/distant persons.
- Dataset: UAV-Human v1, 170–210 epochs, batch size ≈64.
- Optimization: AdamW, learning rate 5e-4 with step decay.
This multi-stage, aerial-centric, and data-augmentative regime is essential for robust generalization to the idiosyncrasies of UAV imaging.
4. Quantitative Performance and Benchmark Results
Person Detection Performance (COCO [email protected]:0.95)
| Method | mAP (all) | AR (all) | mAP (VisDrone) | mAP (FP-104) | Avg. AP | Avg. AR |
|---|---|---|---|---|---|---|
| Baseline (COCO only) | 14.33 | 26.76 | 10.44 | 10.26 | 14.33 | 26.76 |
| + VisDrone only | 21.43 | 32.61 | 21.08 | 22.42 | 21.43 | 32.61 |
| + Multi-Dataset | 28.21 | 38.20 | 21.07 | 22.67 | 28.21 | 38.20 |
| + COCO re-introduced | 28.07 | 39.21 | 20.21 | 25.05 | 28.07 | 39.21 |
| + NWD Loss | 27.96 | 39.14 | 20.20 | 27.41 | 27.96 | 39.14 |
Key finding: Multi-dataset aerial fine-tuning provides a +6.8 mAP increase (COCO-only to aerial), and using Normalized Wasserstein Distance enhances bounding box regression.
Pose Estimation on UAV-Human v1 (COCO keypoint [email protected]:0.95)
| Method | mAP (COCO) | mAP (UAV-H fine-tuned) | Latency A6000 [ms] | Latency Jetson [ms] |
|---|---|---|---|---|
| ViTPose-S | 61.09 | 65.76 | 110.23 | 6.54 |
| ViTPose-B | 63.15 | 67.50 | 116.20 | 11.62 |
| ViTPose-L | 66.50 | 70.31 | 198.30 | 22.35 |
| ViTPose-H | 67.52 | 73.18 | 322.55 | n/a |
- Baseline (AlphaPose, Li et al. 2021): 56.9 mAP.
- ViTPose-H achieves a +16.3 mAP gain over prior best on UAV-Human.
Real-time throughput is enabled by fast inference: 13 ms/frame (detection), 6.5 ms/frame (pose), with total overhead under 20 ms.
5. Onboard Deployment and Resource Characterization
FlyPose has been demonstrated onboard a quadrotor UAV (max. 35 kg MTOW) carrying a Jetson Orin AGX developer kit and a gimbal camera (≈4 kg payload). The pipeline achieves ≈20 ms end-to-end per frame (starting after an initial RTSP camera setup delay of ≈300 ms), aligned with 50 FPS operation. This real-time processing headroom is intended to support downstream tasks such as action or gesture recognition (Farooq et al., 9 Jan 2026).
Total latency breakdown:
- Detection: 13 ms/frame.
- Pose: 6.5 ms/frame.
- Pre-/post-processing: ≈0.5 ms/frame.
A plausible implication is that such performance margins enable closed-loop control or human-in-the-loop UAV missions.
6. FlyPose-104 Dataset and Challenge Aspects
FlyPose-104 is a new test-only dataset released alongside FlyPose to benchmark aerial human pose detectors under unconstrained conditions.
- Frames: 104 aerial images (own + public sources).
- Annotations: 193 person instances, each labeled with COCO-format bounding boxes, 17 keypoints, and visibility flags.
- Scene variation: Altitudes from 5–50 m, 90° nadir and steep oblique views, variable backgrounds (snow, water, dirt, urban).
- Difficulties: Frequent severe self-occlusion, extreme scale variance (blurry, low-res small persons), and challenging lower limb and face keypoints.
- Usage: Test-only split, explicitly for detector generalization.
This suggests FlyPose-104 will facilitate further research into failure modes for small-scale, top-down pose estimation from aerial perspectives.
7. Broader Context and Comparative Remarks
The approach and challenges addressed by FlyPose are distinct from multi-view 3D estimation strategies—such as AirPose (Saini et al., 2022)—which leverage multiple UAVs, distributed inference, and implicit cross-view parametric fusion for 3D pose and shape recovery. FlyPose targets single-camera, 2D keypoint estimation in real time for safety-critical UAV deployments at the edge, whereas AirPose tackles the added complexity of decentralized multi-agent 3D reconstruction. Both share hardware deployment challenges, sensitivity to occlusion, and reliance on comprehensive data annotation.
Continued advances in dataset creation (e.g., FlyPose-104), network architectures, and edge deployment protocols are likely to drive improved robustness and generalizability of pose systems across diverse airborne applications.