AirSim360: 360° UAV Simulation Platform
- AirSim360 is a simulation platform that generates native 360° panoramic images with pixel- and entity-level annotations for UAV perception and mapping tasks.
- It leverages Unreal Engine 5 for synchronized multisensor capture, integrating automated minimum-snap trajectory planning and interactive pedestrian simulation.
- The system addresses data scarcity and annotation challenges by providing scalable, render-aligned benchmarks for omnidirectional vision in robotics and navigation.
AirSim360 is a simulation platform incorporating native 360-degree omnidirectional image generation, ground-truth labeling, interactive agent simulation, and automated trajectory planning from drone viewpoints, specifically designed to address the data scarcity and annotation challenges in omnidirectional vision for robotics, mapping, and navigation tasks. The system builds on Unreal Engine 5 (UE5) and presents synchronized multisensor scene capture, pixel- and entity-level ground truth, and scalable flight data collection, facilitating aerial benchmarking for perception and autonomy research requiring true 4D (space-time) panoramic realism (Ge et al., 1 Dec 2025).
1. Motivation and Context
Emerging robotic and computer vision applications demand omnidirectional “spatial intelligence,” including tasks like panoramic mapping, agent-centric navigation, and vision-language grounding. Existing panoramic datasets are limited both in scale (typically only a few thousand annotated images) and in annotation density, due to the high overhead of pixel- and instance-level manual labeling. Predecessor UAV simulators—including AirSim (Shah et al., 2017), CARLA, and UnrealZoo—lack native support for panoramic (360°) image rendering, typically resorting to successive camera rotations and manual frame reassembly. These techniques introduce inefficiencies and misalignments, especially when transferring ground-truth modalities such as depth or semantic segmentations between perspective and equirectangular/cubemap domains.
AirSim360 addresses these gaps by introducing:
- Render-aligned omnidirectional image and annotation generation (direct panoramic rendering with one-to-one modality correspondence)
- Interactive simulation of pedestrian agents with dense skeletal keypoint annotation
- Automated, minimum-snap trajectory generation for scalable and diverse data acquisition The platform enables the systematic simulation of the real-world at both the scene and event levels in a 4D omnidirectional setting.
2. Architecture and Core Modules
AirSim360’s architecture consists of three tightly coupled components:
- UE-based Rendering and Flight Control Loop: Implements synchronized multisensor capture (six monocular pinhole cameras for full 360° coverage, flight control, and recording).
- Offline Data Collection Toolkit: Handles GPU-side stitching of raw cubemap images into panoramic (equirectangular) images, automated association with pixel- and entity-level ground truth, and quality assurance through alignment metrics.
- Interactive Pedestrian-Aware System (IPAS): Deploys multiple non-player character (NPC) pedestrians controlled by behavior trees and state machines, with real-time streaming of 3D keypoints and option for social force field modeling.
Render-Aligned Data and Labeling
- Uses six 90° FOV virtual pinhole cameras oriented along ±X, ±Y, and ±Z, yielding simultaneous outputs (each of size ), GPU-stitched into an equirectangular image . Spherical projection and cubemap mapping ensure a bijective pixel correspondence.
- Pixelwise depth is computed as , where is the camera center, and is the 3D world point, extracted from the UE z-buffer.
- Semantic segmentation leverages the stencil buffer with mesh integer labels mapped to semantic categories.
- Entity segmentation performs two-pass labeling to resolve the 256-label stencil buffer limit, providing per-instance IDs for all actors (static meshes, skeletal meshes, landscapes).
- Render-ground-truth alignment is enforced for each pixel via the loss
where denotes the rendered modality and the saved ground-truth, used in QA/calibration.
Interactive Pedestrian-Aware System
- NPC behaviors (walk, chat, phone-call) are controlled via state machines and event-driven dispatchers (supporting events such as “meet,” “collide,” “idle”).
- Real-time skeleton keypoint annotation is streamed for each agent by reading bone transformations, with custom joint definitions possible using UE add-socket mechanisms.
- Optional Social Force Model overlays inter-agent repulsion and target-directed behaviors:
where is calibrated for physical plausibility within UE’s physics engine.
Automated Trajectory Generation
- Implements the Minimum-Snap trajectory framework ([Mellinger & Kumar, 2011]), framing each path segment between waypoints as a 5th-order polynomial
with a snap cost
and subject to speed and acceleration constraints (, ). The resulting quadratic program is solved per-trajectory, enabling efficient, diverse path sampling.
Pseudocode: Minimum-Snap Trajectory Generation
1 2 3 4 5 6 7 8 9 10 |
Algorithm 1: Minimum‐Snap Trajectory Generation
Input: waypoints W={w0…wM}, v_max, a_max, sampling Δt
Output: time‐parameterized S(t)=(p(t),v(t),a(t))
1: Partition total flight time into segments [0,T1],…,[T_{M-1},T_M].
2: For each segment i:
– Build polynomial basis p_i(t)=a_{i,0}+…+a_{i,5} t^5.
3: Assemble Q from ∫ (d^4p/dt^4)^2 dt, and constraint matrices A, b.
4: Solve QP for coefficients a_i.
5: Sample t=0:Δt:T_M to obtain p(t), v(t)=p'(t), a(t)=p''(t).
6: Return S(t). |
3. Dataset Composition and Annotations
The Omni360-X dataset suite generated via AirSim360 consists of three principal benchmarks:
3.1 Omni360-Scene
| Scene | Area (m²) | #Images | #Semantic Categories |
|---|---|---|---|
| City Park | 800,000 | 25,600 | 25 |
| Downtown West | 60,000 | 6,800 | 29 |
| SF City | 250,000 | 22,000 | 20 |
| New York City | 44,800 | 6,600 | 25 |
Each panorama provides dense pixelwise depth, semantic segmentation ($25$–$29$ classes), and per-entity segmentation (over 1 million instances per scene).
3.2 Omni360-Human
| Scene | NPC Count | Area (m × m) | Frames |
|---|---|---|---|
| New York City | 15–45 | 12×12 to 30×30 | 29,000 |
| Lisbon Downtown | 10–45 | 12×12 to 30×50 | 9,000 |
| Downtown City | 8–30 | 12×12 to 30×30 | 27,000 |
Provides 3D monocular pedestrian keypoints and absolute agent positions; total of 100,700 annotated frames.
3.3 Omni360-WayPoint
| Scene | #Routes | Length range (m) | (m/s) | (m/s²) |
|---|---|---|---|---|
| City Park | 20,000 | [50,150] | 16 | 3 |
| Downtown W. | 5,000 | [20,50] | 16 | 3 |
Enables benchmarking of trajectory-following tasks with strict dynamic feasibility.
4. Experimental Benchmarks
AirSim360 data were utilized for benchmarking across several core tasks, evaluating transferability and upstream gains when integrated into existing models.
4.1 Monocular Pedestrian Distance Estimation
Using the MonoLoco++ baseline, networks trained on nuScenes + Omni360-Human produce lower angular errors compared to nuScenes alone. Specifically, angular error on the FreeMan test set was reduced from to , a improvement.
| Train Set | Test Set | Dist. Err (m) | Ang. Err (°) |
|---|---|---|---|
| nuScenes | KITTI | 0.822 | 31.5 |
| nuScenes+Omni360 | KITTI | 0.809 | 31.2 |
| … | FreeMan | 0.260→0.228 | 17.0→11.6 |
4.2 Panoramic Depth Estimation
Training UniK3D on Omni360 outperforms Deep360 baseline on average relative error ( vs. $8.257$) and ratio ($0.399$ vs. $0.349$) when evaluated on SphereCraft.
| Train Data | Test Data | AbsRel ↓ | ↑ |
|---|---|---|---|
| Deep360 | SphereCraft | 8.257 | 0.349 |
| Omni360 | SphereCraft | 5.437 | 0.399 |
4.3 Panoramic Segmentation
Augmenting WildPASS training data with Omni360-Scene increases semantic mIoU from $58.0$ to $67.4$ and entity mAP from $24.6$ to $38.9$.
| Data | Sem. mIoU ↑ | Ent. mAP ↑ |
|---|---|---|
| WildPASS only | 58.0 | 24.6 |
| +Omni360-Scene | 67.4 | 38.9 |
4.4 Vision-Language Navigation
Evaluation on panoramic VLN using qwen2.5-vl-72b vs. doubao-seed-1 models:
| Model | SR | SPL | NE |
|---|---|---|---|
| qwen2.5-vl-72b | 0.40 | 0.38 | 18099 |
| doubao-seed-1 | 0.50 | 0.48 | 10573 |
Where SR is success rate, SPL is success weighted by path length, and NE is navigation error.
5. Comparison with Legacy Simulators
Previous simulators such as AirSim (Shah et al., 2017) offer 360° visual simulation via a UE4 plugin encompassing modular vehicle models, a physics engine, and extensible sensor interfaces (IMU, barometer, GPS, RGB, depth). While AirSim supports synchronized capture of pinhole or cubemap camera arrays and can yield equirectangular projections by scriptable post-processing, native omnidirectional support is limited:
- Manual configuration of six orthogonal cameras via settings.json and post-run panorama stitching is required
- Depth, segmentation, and associated ground truths require transformation from the perspective to equirectangular domain, often introducing alignment errors
- No built-in interactive human simulation or annotation at the keypoint level
- Trajectory planning is left to user scripts, without built-in minimum-snap optimization
AirSim360 directly addresses these limitations via native panoramic rendering, intrinsic render-aligned ground-truth, built-in agent simulation with annotation, and trajectory planners specifically tailored to UAV perception needs.
6. Limitations and Future Directions
The current AirSim360 release is scoped to urban/outdoor environments, with no imposed weather, wind, LiDAR, or thermal imaging effects. Pedestrian simulation, while interactive and annotated at skeletal granularity, could benefit from more richly parameterized social force or interaction models. The roadmap prioritizes:
- Integration of environmental disturbances (weather, wind)
- Expansion toward multi-UAV coordinated operations and adversarial scenarios
- Release of plugins for additional sensors: LiDAR, event cameras, hyperspectral imaging
- Direct support for reinforcement-learning agents in closed-loop navigation and perception tasks
- Public dissemination of the entire toolchain and Omni360-X dataset to catalyze adoption and enable comparative evaluation across the community
AirSim360 constitutes the first UAV simulation suite to consistently realize render-aligned, large-scale, annotated, and interactive omnidirectional image generation, establishing a new baseline for aerial panoramic perception research (Ge et al., 1 Dec 2025).