FlyPose: Aerial Real-Time Pose Estimation

Updated 16 January 2026

FlyPose is a top-down human pose estimator that integrates RT-DETRv2-S for detection and ViTPose for pose estimation in a real-time UAV setting.
It employs lightweight networks and specialized multi-dataset training protocols to achieve over 50 fps on edge devices, ensuring rapid inference.
The framework addresses challenges like steep camera angles, frequent occlusions, and small scales, and introduces the FlyPose-104 dataset for evaluation.

FlyPose is a top-down, real-time human pose estimation framework specifically designed for deployment on unmanned aerial vehicles (UAVs). Addressing the unique challenges of aerial human perception—such as steep camera angles, frequent occlusions, variable altitude, and small person scale—FlyPose introduces a multi-stage pipeline leveraging lightweight detection and pose estimation networks, optimized for rapid inference on edge hardware. Coupled with specialized training protocols, cross-dataset fine-tuning, and the release of a dedicated aerial pose dataset, FlyPose represents a significant advance in human pose estimation from aerial perspectives (Farooq et al., 9 Jan 2026).

1. System Architecture and Processing Pipeline

FlyPose employs a two-stage, top-down pipeline for estimating human pose from UAV imagery:

Stage 1: Person Detection

Network: RT-DETRv2-S, a single-stage DEtection TRansformer variant with a ResNet-18 backbone pretrained on COCO+Objects365.
Input: Full-HD (1920×1080) RGB or thermal images from a gimbal-mounted UAV camera.
Output: $N$ bounding boxes $\mathcal{B} = \{b_1, \ldots, b_N\}$ , where $b_i = (x, y, w, h)$ .
Losses:
- Classification: focal or cross-entropy for the "person" class.
- Regression: Normalized Wasserstein Distance Loss,
$\mathcal{L}_{NWD} = 1 - \exp\left(-\frac{W(b, b^*)}{\sigma^2}\right)$

replacing standard Generalized IoU.
Inference Speed: ∼13 ms per frame (Jetson Orin AGX, TensorRT-FP32).

Stage 2: Pose Estimation

Network: ViTPose-S, a Vision Transformer-based regressor with 12 layers and a heatmap prediction head.
Input crop: Each detected box is resized (long edge to 256 px; short edge to 192 px) with zero-padding for aspect-ratio preservation.
Output: $K=17$ COCO keypoint heatmaps $\hat{H}_k \in \mathbb{R}^{256 \times 192}$ .
Loss: Per-pixel mean-squared error,

$\mathcal{L}_{MSE} = \frac{1}{K} \sum_k \| H_k - \hat{H}_k \|_2^2$
Post-processing: Non-maximum suppression on heatmaps yields $(x, y)$ keypoint coordinates.
Inference Speed: ∼6.5 ms per frame (Jetson Orin AGX, TensorRT-FP32).

Overall, the pipeline achieves ≈19.5 ms per frame total latency, supporting real-time (>50 fps) operation onboard UAVs (Farooq et al., 9 Jan 2026).

2. Network Design and Model Variants

RT-DETRv2-S Person Detector

Architecture: ResNet-18 backbone (18 convolutional layers; feature maps at strides 4, 8, 16, 32), lightweight transformer encoder–decoder heads (6 layers each).
Input Resolution: 1280 px on the shorter side during both training and evaluation.
Loss: Normalized Wasserstein Distance Loss for bounding box regression.

ViTPose-S Pose Estimator

Backbone: Vision Transformer, 12 layers, 12-head self-attention, hidden dimension 384.
Patch Size: $16 \times 16$ for $256 \times 192$ input, resulting in $\mathcal{B} = \{b_1, \ldots, b_N\}$ 0 sequence tokens.
Heatmap Head: Deconvolutional upsampling, $\mathcal{B} = \{b_1, \ldots, b_N\}$ 1.
Variants: S, B, L, and H (12, 24, 32, 48 layers; hidden dims 384, 768, 1024, 1280, respectively).

This modular network design enables optimization for resource-constrained edge devices and flexibility across input modalities.

3. Training Datasets, Augmentation, and Optimization

RT-DETRv2-S Training Protocol

Pretraining: COCO + Objects365 datasets.
Fine-tuning – Stage 1: 60 epochs on VisDrone2019-DET (person class only).
Fine-tuning – Stage 2: Ingest and fine-tune on eight additional aerial datasets, totaling 66,849 training and 21,164 validation images.
COCO-Person Reintroduction: 50 epochs to preserve performance on frontal, natural-camera images.
Loss Switch: Final 50 epochs using $\mathcal{B} = \{b_1, \ldots, b_N\}$ 2.
Optimization: AdamW optimizer (learning rate ≈ $\mathcal{B} = \{b_1, \ldots, b_N\}$ 3; weight decay ≈ $\mathcal{B} = \{b_1, \ldots, b_N\}$ 4), batch size 16–32.

ViTPose Fine-Tuning

Initialization: Pretrained on COCO-Keypoints.
Augmentation: Half-body, rotation ( $\mathcal{B} = \{b_1, \ldots, b_N\}$ 530°), scaling ( $\mathcal{B} = \{b_1, \ldots, b_N\}$ 630%), and down-scaling (5–20%) for simulating small/distant persons.
Dataset: UAV-Human v1, 170–210 epochs, batch size ≈64.
Optimization: AdamW, learning rate 5e-4 with step decay.

This multi-stage, aerial-centric, and data-augmentative regime is essential for robust generalization to the idiosyncrasies of UAV imaging.

4. Quantitative Performance and Benchmark Results

Person Detection Performance (COCO mAP@0.5:0.95)

Method	mAP (all)	AR (all)	mAP (VisDrone)	mAP (FP-104)	Avg. AP	Avg. AR
Baseline (COCO only)	14.33	26.76	10.44	10.26	14.33	26.76
+ VisDrone only	21.43	32.61	21.08	22.42	21.43	32.61
+ Multi-Dataset	28.21	38.20	21.07	22.67	28.21	38.20
+ COCO re-introduced	28.07	39.21	20.21	25.05	28.07	39.21
+ NWD Loss	27.96	39.14	20.20	27.41	27.96	39.14

Key finding: Multi-dataset aerial fine-tuning provides a +6.8 mAP increase (COCO-only to aerial), and using Normalized Wasserstein Distance enhances bounding box regression.

Pose Estimation on UAV-Human v1 (COCO keypoint [email protected]:0.95)

Method	mAP (COCO)	mAP (UAV-H fine-tuned)	Latency A6000 [ms]	Latency Jetson [ms]
ViTPose-S	61.09	65.76	110.23	6.54
ViTPose-B	63.15	67.50	116.20	11.62
ViTPose-L	66.50	70.31	198.30	22.35
ViTPose-H	67.52	73.18	322.55	n/a

Baseline (AlphaPose, Li et al. 2021): 56.9 mAP.
ViTPose-H achieves a +16.3 mAP gain over prior best on UAV-Human.

Real-time throughput is enabled by fast inference: 13 ms/frame (detection), 6.5 ms/frame (pose), with total overhead under 20 ms.

5. Onboard Deployment and Resource Characterization

FlyPose has been demonstrated onboard a quadrotor UAV (max. 35 kg MTOW) carrying a Jetson Orin AGX developer kit and a gimbal camera (≈4 kg payload). The pipeline achieves ≈20 ms end-to-end per frame (starting after an initial RTSP camera setup delay of ≈300 ms), aligned with 50 FPS operation. This real-time processing headroom is intended to support downstream tasks such as action or gesture recognition (Farooq et al., 9 Jan 2026).

Total latency breakdown:

Detection: 13 ms/frame.
Pose: 6.5 ms/frame.
Pre-/post-processing: ≈0.5 ms/frame.

A plausible implication is that such performance margins enable closed-loop control or human-in-the-loop UAV missions.

6. FlyPose-104 Dataset and Challenge Aspects

FlyPose-104 is a new test-only dataset released alongside FlyPose to benchmark aerial human pose detectors under unconstrained conditions.

Frames: 104 aerial images (own + public sources).
Annotations: 193 person instances, each labeled with COCO-format bounding boxes, 17 keypoints, and visibility flags.
Scene variation: Altitudes from 5–50 m, 90° nadir and steep oblique views, variable backgrounds (snow, water, dirt, urban).
Difficulties: Frequent severe self-occlusion, extreme scale variance (blurry, low-res small persons), and challenging lower limb and face keypoints.
Usage: Test-only split, explicitly for detector generalization.

This suggests FlyPose-104 will facilitate further research into failure modes for small-scale, top-down pose estimation from aerial perspectives.

7. Broader Context and Comparative Remarks

The approach and challenges addressed by FlyPose are distinct from multi-view 3D estimation strategies—such as AirPose (Saini et al., 2022)—which leverage multiple UAVs, distributed inference, and implicit cross-view parametric fusion for 3D pose and shape recovery. FlyPose targets single-camera, 2D keypoint estimation in real time for safety-critical UAV deployments at the edge, whereas AirPose tackles the added complexity of decentralized multi-agent 3D reconstruction. Both share hardware deployment challenges, sensitivity to occlusion, and reliance on comprehensive data annotation.

Continued advances in dataset creation (e.g., FlyPose-104), network architectures, and edge deployment protocols are likely to drive improved robustness and generalizability of pose systems across diverse airborne applications.

Markdown Report Issue Upgrade to Chat

References (2)

FlyPose: Towards Robust Human Pose Estimation From Aerial Views (2026)

AirPose: Multi-View Fusion Network for Aerial 3D Human Pose and Shape Estimation (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FlyPose.

FlyPose: Aerial Real-Time Pose Estimation

1. System Architecture and Processing Pipeline

2. Network Design and Model Variants

3. Training Datasets, Augmentation, and Optimization

4. Quantitative Performance and Benchmark Results

5. Onboard Deployment and Resource Characterization

6. FlyPose-104 Dataset and Challenge Aspects

7. Broader Context and Comparative Remarks

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

FlyPose: Aerial Real-Time Pose Estimation

1. System Architecture and Processing Pipeline

2. Network Design and Model Variants

3. Training Datasets, Augmentation, and Optimization

4. Quantitative Performance and Benchmark Results

5. Onboard Deployment and Resource Characterization

6. FlyPose-104 Dataset and Challenge Aspects

7. Broader Context and Comparative Remarks

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research