UAVScenes: Multi-Modal UAV Dataset

Updated 4 July 2026

UAVScenes is a multi-modal UAV dataset featuring over 120k image–LiDAR pairs with frame-wise semantic annotations and accurate 6-DoF poses.
It extends the MARS-LVIG corpus by integrating detailed semantic labels for both images and LiDAR, supporting tasks such as 2D/3D segmentation, monocular depth estimation, and visual localization.
Its precise sensor synchronization and rigorous geometric calibration enable robust cross-modal perception research while addressing challenges like class imbalance and domain shift.

UAVScenes is a large-scale, multi-modal UAV perception dataset built upon the well-calibrated MARS-LVIG SLAM dataset and extended with frame-wise semantic annotations for both images and LiDAR point clouds, as well as accurate per-frame 6-degree-of-freedom poses. Its design target is not a single benchmark family but a unified evaluation substrate for 2D semantic segmentation, 3D semantic segmentation, monocular depth estimation, 6-DoF visual localization, place recognition, and novel view synthesis. The dataset contains over 120k image–LiDAR frame pairs with labels and poses, spans multiple outdoor traversals under daytime and evening conditions, and reorients a SLAM-oriented corpus toward high-level scene understanding and cross-modal perception research (Wang et al., 30 Jul 2025).

1. Provenance, scope, and benchmark intent

UAVScenes was introduced to address a specific limitation in the UAV dataset landscape: most existing multi-modal UAV datasets either target SLAM and 3D reconstruction without semantic labels, or provide only map-level semantics rather than frame-wise annotations. UAVScenes therefore augments MARS-LVIG with manually labeled semantic annotations for both frame-wise images and LiDAR point clouds, accurate 6-DoF poses reconstructed per frame and geo-aligned to reconstructed maps, and a standardized benchmark covering six tasks: 2D semantic segmentation, 3D semantic segmentation, monocular depth estimation, 6-DoF visual localization, place recognition, and novel view synthesis (Wang et al., 30 Jul 2025).

The dataset scope is explicitly large-scale and outdoor. It comprises 23 sequences in total, of which 20 sequences have full frame-wise annotations and 6-DoF poses, while 3 “Featureless_GNSS” sequences have dynamic-object instance labels only. The 20 fully annotated sequences are organized into 8 splits, with 1–3 sequences per split, and each split is reconstructed into a 3D map with per-frame 6-DoF poses. The environments include rural towns, valleys, an aero-model airfield, and islands or coastal scenes; illumination diversity is represented primarily through daytime and “Evening” traversals rather than explicit weather variation.

A recurrent misconception is to treat UAVScenes as a generic label for “UAV scenes.” In fact, it is a specific benchmark dataset with a defined sensor suite, annotation protocol, and task inventory. Another common misunderstanding is to regard it as a dataset created from scratch; the paper instead presents it as an enhancement of MARS-LVIG, with additional semantics and geometry sufficient to support tasks far beyond the original SLAM emphasis.

2. Sensor platform, synchronization, and geometric foundations

The acquisition platform is a DJI M300 RTK industrial UAV. The mounted sensors in MARS-LVIG include the DJI L1, a Hikvision RGB camera, and a Livox Avia LiDAR, but UAVScenes uses the Hikvision camera and Livox Avia LiDAR for all per-frame annotations because DJI L1 LiDAR frames are not accessible per frame due to manufacturer encryption. Camera and Livox Avia are hardware-synchronized at 10 Hz, and the original data are recorded as ROS bags with topics /left_camera/image/compressed, /livox/lidar, and /dji_osdk_ros/rtk_position (Wang et al., 30 Jul 2025).

The RGB image resolution is $2448 \times 2048 \times 3$ per frame. LiDAR point count per frame and field of view are not specified in the paper, though outlier filtering is applied in preprocessing. GNSS/RTK positions are provided and interpolated to image frames for SfM initialization. Representative flight parameters vary by sequence family: AMtown is flown at approximately 80 m altitude with speeds around $4/8/12$ m/s; AMvalley at approximately 130 m with similar speeds; HKairport at approximately 80 m with speeds around $3/6/9$ m/s; and HKisland at approximately 90 m with speeds around $3/6/9$ m/s. Sequence durations range from about 365 to 1,354 seconds.

The dataset reuses camera–LiDAR intrinsics and extrinsics from MARS-LVIG and validates alignment by rendering views from reconstructed 3D maps overlaid with real images. MARS-LVIG provides 4-DoF $(x,y,z,\text{yaw})$ via RTK and maps, whereas UAVScenes reconstructs full 6-DoF per-frame poses using UAV-oriented SfM via DJI Terra, taking GNSS coordinates as initialization. Each split yields a 3D map and 6-DoF poses for all frames; reconstruction time per split is reported as approximately 3–10 hours on an i9-13900K with $2 \times$ RTX 4090.

The paper formalizes the geometric interface in standard terms. A rigid transform in $SE(3)$ is written as

$T = \begin{bmatrix} R & t \ 0 & 1 \end{bmatrix}, \qquad R \in SO(3),\ t \in \mathbb{R}^3.$

Camera projection of a world point $X_w$ is

$x_c \sim K [R_{cw} \mid t_{cw}] X_w,$

and the LiDAR-to-camera transform is

$4/8/12$0

For localization, pose error metrics are defined as

$4/8/12$1

These geometric provisions are central to why UAVScenes supports both perception and spatial reasoning tasks. The dataset is not merely multi-sensor; it is calibrated and pose-resolved at per-frame granularity, which is what permits image–LiDAR projection, absolute localization benchmarking, and point-cloud–initialized rendering pipelines.

3. Annotation model and semantic taxonomy

UAVScenes provides 120k labeled images and 120k labeled LiDAR frames. The label space has 19 classes in total: 16 static classes, 2 dynamic classes, and 1 background class. The static classes are Roof, Dirt Road, Paved Road, River (sea), Pool (reservoir), Bridge, Container, Airstrip, Traffic Barrier, Green Field (grass/trees/farmland), Wild Field (sparse vegetation/soil), Solar Panel, Umbrella, Transparent Roof, Car Park, and Paved Walk. The dynamic classes are Sedan and Truck, with Sedan defined to include sedans, SUVs, MPVs, wagons, and hatchbacks, and Truck defined as cargo vehicles (Wang et al., 30 Jul 2025).

The labeling pipeline is hybrid 3D-first and image-first. Static classes are annotated on the reconstructed 3D map and then rendered into frame-wise image masks using SfM poses. Dynamic objects are manually annotated per image using instance-wise polygons. Static and dynamic labels are then fused into the final per-image semantic labels. Image labels are projected into LiDAR using the known calibration, after which per-frame LiDAR labels are manually checked and corrected.

The toolchain is explicitly documented. CloudCompare is used for 3D map annotation and visualization. X-AnyLabeling is used for image instance annotation, with auto-tracking assistance and manual verification or fixing. Quality control extends across modalities: camera–LiDAR annotation transfer is followed by manual correction at the LiDAR frame level.

The dataset also reports dynamic instance statistics. There are approximately 270k sedan instances and approximately 14k truck instances. Average polygon areas and occupancies are reported; for example, sedan polygons average approximately 3,210 pixels and truck polygons approximately 6,873 pixels. The paper notes a known class-imbalance difficulty: Pool is very rare in LiDAR, and baseline 3D segmentation methods achieve 0 IoU on that class.

This annotation design produces a particular kind of consistency. Because static classes are rendered from annotated 3D maps, class labels remain consistent across sequences within a split. That is materially different from frame-by-frame semantic painting alone, especially in a dataset intended to connect image segmentation, LiDAR segmentation, localization, and view synthesis under a common world model.

4. Benchmarked tasks, metrics, and baseline results

UAVScenes includes official baselines for six tasks. The metrics are task-specific but standardized. For semantic segmentation, the paper uses

$4/8/12$2

For novel view synthesis, it reports PSNR, SSIM, and LPIPS. For monocular depth estimation, it reports $4/8/12$3, $4/8/12$4, $4/8/12$5, and $4/8/12$6 for $4/8/12$7. For absolute pose estimation, it uses the rotation and translation errors defined in the previous section (Wang et al., 30 Jul 2025).

Task	Baselines	Representative result
2D semantic segmentation	UperNet with ResNet, ConvNeXt, ViT, MambaOut, DeiT3 backbones	DeiT3-m: 68.3 mIoU
3D semantic segmentation	SPUNet, PTv2, MinkUNet	SPUNet: 34.4 mIoU
Place recognition	GeM, RRM, ConvAP, MixVPR, AnyLoc, SALAD, MinkLoc3D, BEVPlace, MinkLoc++, AdaFusion	SALAD (DINOv2-s): Recall@1/5/10 = 67.1 / 76.4 / 79.8
Novel view synthesis	Instant-NGP, 3DGS, GaussianPro, DCGaussian, Pixel-GS	AMvalley example: 3DGS 25.12 PSNR
6-DoF visual localization	PoseNet, AtLoc, RobustLoc, ACE, GLACE, FocusTune	RobustLoc: approximately 6.1 m / 0.1°
Monocular depth estimation	ZoeDepth, Depth Anything, Metric3D, Marigold, GeoWizard	Metric3D V2 (ViT-L): lowest AbsRel, approximately 0.540

The 2D semantic segmentation benchmark shows a clear backbone hierarchy. Transformer backbones consistently outperform CNNs, with DeiT3-m reaching 68.3 mIoU, DeiT3-s 67.6, ViT-s approximately 63.9, and ResNet-50 approximately 61.3. In 3D semantic segmentation, the task remains substantially harder: SPUNet reaches 34.4 mIoU, PTv2 33.2, and MinkUNet 32.7. The paper attributes some of this difficulty to airborne point-cloud properties such as sparser vertical structure, long ranges, and class imbalance.

Place recognition results further differentiate modalities. Image-only descriptors based on foundation backbones are strongest: SALAD with DINOv2-s reaches Recall@1/5/10 of 67.1 / 76.4 / 79.8, compared with AnyLoc at 58.5 / 74.4 / 79.1. LiDAR-only baselines are lower, with MinkLoc3D V2 at 42.8 / 61.5 / 67.3 and BEVPlace at 32.6 / 54.6 / 64.2; the paper notes that BEV is less effective for UAV geometry. Fusion methods, including MinkLoc++ at 47.1 / 63.5 / 69.0 and AdaFusion at 46.3 / 63.4 / 70.2, are beneficial but still challenged by aerial viewpoint changes and wide baselines.

For novel view synthesis, Instant-NGP underperforms due to large scene scale, while Gaussian Splatting variants perform better. On AMvalley, example PSNR values are 25.12 for 3DGS, 25.09 for GaussianPro, and 25.11 for Pixel-GS, with SSIM around 0.576 and LPIPS around 0.514–0.516. Hard regions include repetitive building facades and dense forests.

For 6-DoF visual localization, APR baselines outperform SCR baselines in this large-scale aerial setting. RobustLoc averages approximately 6.1 m / 0.1°, AtLoc approximately 8.6 m / 0.1°, and PoseNet approximately 25.8 m / 0.2°. GLACE, an SCR method, averages approximately 85.5 m / 0.6°, and ACE and FocusTune have larger errors than the APR baselines reported here. The paper interprets this as evidence that SCR models trained on ground scenes suffer from domain gap under UAV viewpoints.

Zero-shot monocular depth estimation remains difficult. Metric3D V2 with ViT-L achieves the lowest $4/8/12$8 at approximately 0.540 and the lowest $4/8/12$9 at approximately 31.960; Depth Anything V2 attains the best $3/6/9$0 metric among the tested models. Diffusion-based affine-invariant models such as Marigold and GeoWizard lag on metric depth without alignment. This benchmark therefore exposes a pronounced gap between ground-view pretraining and high-altitude UAV deployment.

5. Position within the UAV dataset ecosystem

UAVScenes is best understood relative to several neighboring but non-equivalent datasets. Compared with its base dataset MARS-LVIG, the central addition is semantic and task breadth: MARS-LVIG focuses on SLAM with 4-DoF poses and maps but no frame-wise semantics, whereas UAVScenes adds frame-wise image and LiDAR annotations, reconstructed 6-DoF poses, and per-split 3D maps suitable for segmentation, depth, localization, retrieval, and rendering (Wang et al., 30 Jul 2025).

Relative to camera-only urban segmentation benchmarks, UAVScenes trades urban density for multimodal completeness. UAVid provides 30 oblique-view 4K UAV video sequences, 300 densely labeled images, and 8 semantic classes tailored to urban street scenes, including separate static-car and moving-car labels, but it is RGB-only and does not provide LiDAR or per-frame 6-DoF multi-modal supervision (Lyu et al., 2018). That makes UAVid highly relevant for high-resolution oblique semantic segmentation, but not for unified 2D–3D benchmarking.

Relative to synthetic UAV sources, UAVScenes occupies the real-data end of the spectrum. SkyScenes is a CARLA-based synthetic dataset with 33,600 images at $3/6/9$1, synchronized RGB, semantic, instance, and depth outputs, controlled variation across maps, weather, times of day, heights, and pitch angles, and stored metadata for deterministic regeneration (Khose et al., 2023). FlyAwareV2 is a mixed-reality multimodal dataset for urban scene understanding with approximately 288k synthetic frames and approximately 2k real frames, RGB, depth, and semantics, four weather profiles, synthetic-to-real UDA protocols, and coarse/fine taxonomy mapping, but it does not provide LiDAR point clouds or the same frame-wise image–LiDAR annotation regime as UAVScenes (Barbato et al., 15 Oct 2025).

UAVScenes is also distinct from reconstruction-focused datasets. UAVLight is a benchmark for illumination-robust 3D reconstruction with 18 real outdoor scenes, repeatable geo-referenced flight paths, multiple fixed times of day, RTK-regularized bundle adjustment, per-slot sun directions, and photometric evaluation protocols based on PSNR, SSIM, and LPIPS (Du et al., 26 Nov 2025). Its target problem is cross-illumination SfM, MVS, inverse rendering, and relightability, rather than the broad frame-wise perception suite emphasized by UAVScenes.

Finally, UAVScenes should not be confused with anomaly-detection corpora. MUAAD, introduced as the Manipal UAV Anomalous Activity Dataset, contains 60 UAV videos from 9 campus locations with frame-level anomaly labels and four-class contextual masks for anomaly detection, and the paper explicitly states that the dataset introduced there is MUAAD, not UAVScenes (S et al., 2022).

Within this ecosystem, the paper states that UAVScenes is, “to our knowledge,” the first to provide real-world, frame-wise semantic annotations for both images and LiDAR with accurate 6-DoF poses at this scale. That claim is framed comparatively against datasets that are SLAM-centric, camera-only, synthetic, or map-labeled rather than per-frame cross-modally labeled.

6. Practical use, limitations, and research implications

UAVScenes is distributed with a public repository containing the dataset page, loaders, training and evaluation scripts, and official train/test split instructions. It provides extracted images, filtered per-frame Livox Avia point clouds, GNSS/RTK interpolated to frames, camera intrinsics, LiDAR–camera extrinsics, per-frame camera poses, and semantic labels. The paper also includes minimal usage patterns for reading a frame and pose, projecting LiDAR into the image plane, and computing localization error (Wang et al., 30 Jul 2025).

The authors give concrete usage recommendations. For segmentation and cross-modal labeling, they recommend using the provided LiDAR–camera extrinsics to project LiDAR into the image or colorize LiDAR by image semantics; for point labeling, they note that a 2D-to-3D projection pipeline should be refined in 3D to address occlusions and label leakage. For class imbalance, especially for severely underrepresented categories such as Pool, they recommend class-aware sampling, loss reweighting, or focal losses, and suggest that extremely rare classes may need to be merged or down-weighted for robust training. For reproducibility, they recommend following the official splits and scripts, preserving one-to-one image–LiDAR pairing at 10 Hz, and respecting the resolutions used in the paper, including $3/6/9$2 for NVS, $3/6/9$3 for 6-DoF localization baselines, $3/6/9$4 crops for 2D segmentation baselines, and $3/6/9$5 for depth evaluation.

The limitations are explicit. Scene diversity is large-scale but biased toward rural towns, valleys, airfields, and island or coastal settings; dense downtown urban canyons, complex traffic, and heavy pedestrian scenarios are limited. Most flights are in the 80–130 m altitude range, so extreme low- or high-altitude regimes are sparse. Weather diversity is limited: the dataset contains day and evening illumination changes, but broader weather coverage is not emphasized. Sensor constraints are also material, since DJI L1 LiDAR encryption prevents the use of those per-frame point clouds and forces the benchmark to focus on Livox Avia for LiDAR annotations. On the annotation side, some categories, especially Pool, are underrepresented in LiDAR, and the dynamic classes are restricted to sedan and truck.

These limitations define the current research frontier around UAVScenes. The strong 2D results alongside weaker 3D segmentation, retrieval fusion, and zero-shot depth performance indicate that multimodal aerial perception remains substantially open even when frame-wise labels and calibrated poses are available. A plausible implication is that future work on UAVScenes will focus less on raw dataset construction and more on cross-modal fusion, aerial-specific pretraining, imbalance-robust learning, and geometry-aware adaptation of models that were originally developed for ground-view domains.