- The paper introduces the synthetic C2A dataset that overlays diverse human poses on disaster backgrounds to enhance SAR operations.
- It benchmarks multiple state-of-the-art object detectors, with YOLOv9-e achieving the highest mAP scores (0.6883 mAP, 0.8927 [email protected]).
- Results highlight that combining domain-specific and general human datasets significantly improves model generalization in UAV-assisted SAR operations.
The paper introduces the Combination to Application (C2A) dataset, a novel synthetic dataset designed to address the critical lack of specialized human detection datasets for Unmanned Aerial Vehicle (UAV)-assisted Search and Rescue (SAR) operations in disaster scenarios. Existing datasets often fail to capture the complexities of disaster environments, such as partial occlusion and diverse human poses within chaotic backgrounds, hindering the development of effective machine learning models for this domain.
To bridge this gap, the C2A dataset was synthesized by overlaying human poses from the LSP/MPII-MPHB dataset onto disaster scene backgrounds sourced from the AIDER dataset. The AIDER dataset provides realistic images from four disaster types: Fire/Smoke, Flood, Collapsed Building/Rubble, and Traffic Accidents. The LSP/MPII-MPHB dataset contributes diverse human poses, including bent, kneeling, sitting, upright, and lying positions, which are critical for recognizing individuals in various states in a disaster.
The dataset creation pipeline involves several steps:
- Background Removal and Image Preparation: Human figures are isolated from the LSP/MPII-MPHB images using the U2-Net segmentation model (2002.08906).
- Image Cropping and Cleaning: Isolated figures are cropped to focus on the human subject, and images with minimal foreground content are excluded.
- Overlay Process: Scaled human figures (with randomized scaling) are overlaid onto AIDER disaster background images at random positions. Bounding boxes are generated for the overlaid figures.
The resulting C2A dataset comprises 10,215 images with over 360,000 annotated human instances. Key characteristics of the dataset include a wide range of image resolutions, a significant proportion of small objects (47% under 10 pixels), a majority of objects with aspect ratios less than 1 (wider than tall), and a high object density (peaking at 20-40 objects per image, with some images containing up to 100). Notably, the dataset includes annotations for both the human pose (one of five categories) and the type of disaster scene, providing valuable contextual information.
The researchers benchmarked state-of-the-art object detection models on the C2A dataset, including Faster R-CNN (1506.01497), RetinaNet (1708.02002), Cascade R-CNN (1712.00726), Dino (2303.05499), Rtmdet (2212.07784), YOLOv5 (2207.02696), YOLOv9-c (2402.13616), and YOLOv9-e (2402.13616). The evaluation used mAP and [email protected] metrics. YOLOv9-e achieved the highest mAP (0.6883) and [email protected] (0.8927), demonstrating superior performance on this challenging dataset. The analysis revealed that models generally performed better at a lower IoU threshold ([email protected]), suggesting that refining bounding box precision is an area for improvement.
A crucial finding from comparative training experiments is the importance of combining domain-specific data (C2A) with general human detection datasets. While training solely on C2A improves performance on C2A itself, training on a combination of "General Human" datasets (like CrowdHuman (1805.00123), Tiny Person (2001.06362), VisDrone (1901.01672)) and C2A yields the best generalization across general, real-world SAR (SARD (2103.07237)), and C2A validation sets. This indicates that domain adaptation through dataset combination is vital for building robust SAR detection models.
Analysis of object size versus detection confidence highlighted that smaller objects (under 20 pixels) are significantly harder to detect accurately compared to larger objects, a common challenge in aerial imagery. This suggests that future model optimization should focus on improving sensitivity to tiny objects.
The paper acknowledges the limitations of the synthetic C2A dataset, such as potentially unrealistic scaling and positioning from the overlay process. However, these artificial variations might also serve as a form of data augmentation. Future work should aim to improve realism through context-aware scaling and potentially using dynamic 3D models. The current dataset uses static images, while real-world SAR often involves video feeds; expanding the dataset to include video sequences is a direction for future enhancement. Incorporating real disaster footage is also crucial for validating and refining models for practical application.
In conclusion, the C2A dataset provides a valuable resource for training and benchmarking human detection models for UAV-assisted SAR in disaster scenarios. The research underscores that combining this tailored synthetic data with general human datasets is key to achieving optimal performance and generalization, significantly enhancing the potential effectiveness of AI-assisted interventions in disaster response.