AirSim360: 360° UAV Simulation Platform

Updated 3 December 2025

AirSim360 is a simulation platform that generates native 360° panoramic images with pixel- and entity-level annotations for UAV perception and mapping tasks.
It leverages Unreal Engine 5 for synchronized multisensor capture, integrating automated minimum-snap trajectory planning and interactive pedestrian simulation.
The system addresses data scarcity and annotation challenges by providing scalable, render-aligned benchmarks for omnidirectional vision in robotics and navigation.

AirSim360 is a simulation platform incorporating native 360-degree omnidirectional image generation, ground-truth labeling, interactive agent simulation, and automated trajectory planning from drone viewpoints, specifically designed to address the data scarcity and annotation challenges in omnidirectional vision for robotics, mapping, and navigation tasks. The system builds on Unreal Engine 5 (UE5) and presents synchronized multisensor scene capture, pixel- and entity-level ground truth, and scalable flight data collection, facilitating aerial benchmarking for perception and autonomy research requiring true 4D (space-time) panoramic realism (Ge et al., 1 Dec 2025).

1. Motivation and Context

Emerging robotic and computer vision applications demand omnidirectional “spatial intelligence,” including tasks like panoramic mapping, agent-centric navigation, and vision-language grounding. Existing panoramic datasets are limited both in scale (typically only a few thousand annotated images) and in annotation density, due to the high overhead of pixel- and instance-level manual labeling. Predecessor UAV simulators—including AirSim (Shah et al., 2017), CARLA, and UnrealZoo—lack native support for panoramic (360°) image rendering, typically resorting to successive camera rotations and manual frame reassembly. These techniques introduce inefficiencies and misalignments, especially when transferring ground-truth modalities such as depth or semantic segmentations between perspective and equirectangular/cubemap domains.

AirSim360 addresses these gaps by introducing:

Render-aligned omnidirectional image and annotation generation (direct panoramic rendering with one-to-one modality correspondence)
Interactive simulation of pedestrian agents with dense skeletal keypoint annotation
Automated, minimum-snap trajectory generation for scalable and diverse data acquisition The platform enables the systematic simulation of the real-world at both the scene and event levels in a 4D omnidirectional setting.

2. Architecture and Core Modules

AirSim360’s architecture consists of three tightly coupled components:

UE-based Rendering and Flight Control Loop: Implements synchronized multisensor capture (six monocular pinhole cameras for full 360° coverage, flight control, and recording).
Offline Data Collection Toolkit: Handles GPU-side stitching of raw cubemap images into panoramic (equirectangular) images, automated association with pixel- and entity-level ground truth, and quality assurance through alignment metrics.
Interactive Pedestrian-Aware System (IPAS): Deploys multiple non-player character (NPC) pedestrians controlled by behavior trees and state machines, with real-time streaming of 3D keypoints and option for social force field modeling.

Render-Aligned Data and Labeling

Uses six 90° FOV virtual pinhole cameras oriented along ±X, ±Y, and ±Z, yielding simultaneous outputs $\{I_c\}_{c=1}^6$ (each of size $H_c\times W_c$ ), GPU-stitched into an equirectangular image $I_e \in \mathbb{R}^{H_e\times W_e \times 3}$ . Spherical projection and cubemap mapping ensure a bijective pixel correspondence.
Pixelwise depth is computed as $D(p) = \|\mathbf{X}(p) - \mathbf{C}\|_2$ , where $\mathbf{C}$ is the camera center, and $\mathbf{X}(p)$ is the 3D world point, extracted from the UE z-buffer.
Semantic segmentation leverages the stencil buffer with mesh integer labels mapped to semantic categories.
Entity segmentation performs two-pass labeling to resolve the 256-label stencil buffer limit, providing per-instance IDs for all actors (static meshes, skeletal meshes, landscapes).
Render-ground-truth alignment is enforced for each pixel via the loss

$L_{\mathrm{align}} = \frac{1}{N}\sum_{i=1}^N \|R(p_i) - G(p_i)\|^2,$

where $R$ denotes the rendered modality and $G$ the saved ground-truth, used in QA/calibration.

Interactive Pedestrian-Aware System

NPC behaviors (walk, chat, phone-call) are controlled via state machines and event-driven dispatchers (supporting events such as “meet,” “collide,” “idle”).
Real-time skeleton keypoint annotation is streamed for each agent by reading bone transformations, with custom joint definitions possible using UE add-socket mechanisms.
Optional Social Force Model overlays inter-agent repulsion and target-directed behaviors:

$m_i \frac{d\mathbf{v}_i}{dt} = \frac{m_i(v_i^0\mathbf{e}_i - \mathbf{v}_i)}{\tau_i} + \sum_{j\ne i}\mathbf{f}_{ij} + \sum_{\mathrm{obs}} \mathbf{f}_{i\mathrm{obs}}$

where $\mathbf{f}_{ij}$ is calibrated for physical plausibility within UE’s physics engine.

Automated Trajectory Generation

Implements the Minimum-Snap trajectory framework ([Mellinger & Kumar, 2011]), framing each path segment between waypoints $\{p_0,\dots,p_M\}$ as a 5th-order polynomial

$p_i(t) = \sum_{k=0}^5 a_{i,k} t^k, \quad t\in[0, T_i]$

with a snap cost

$J = \int_0^{T_{\mathrm{tot}}} \left\| \frac{d^4p}{dt^4} \right\|^2 dt,$

and subject to speed and acceleration constraints ( $\|\dot{p}(t)\|\leq v_{\max}$ , $\|\ddot{p}(t)\|\leq a_{\max}$ ). The resulting quadratic program is solved per-trajectory, enabling efficient, diverse path sampling.

Pseudocode: Minimum-Snap Trajectory Generation

Algorithm 1: Minimum‐Snap Trajectory Generation
Input: waypoints W={w0…wM}, v_max, a_max, sampling Δt
Output: time‐parameterized S(t)=(p(t),v(t),a(t))
1: Partition total flight time into segments [0,T1],…,[T_{M-1},T_M].
2: For each segment i:
     – Build polynomial basis p_i(t)=a_{i,0}+…+a_{i,5} t^5.
3: Assemble Q from ∫ (d^4p/dt^4)^2 dt, and constraint matrices A, b.
4: Solve QP for coefficients a_i.
5: Sample t=0:Δt:T_M to obtain p(t), v(t)=p'(t), a(t)=p''(t).
6: Return S(t).

3. Dataset Composition and Annotations

The Omni360-X dataset suite generated via AirSim360 consists of three principal benchmarks:

3.1 Omni360-Scene

Scene	Area (m²)	#Images	#Semantic Categories
City Park	800,000	25,600	25
Downtown West	60,000	6,800	29
SF City	250,000	22,000	20
New York City	44,800	6,600	25

Each panorama provides dense pixelwise depth, semantic segmentation ($25$–$29$ classes), and per-entity segmentation (over 1 million instances per scene).

3.2 Omni360-Human

Scene	NPC Count	Area (m × m)	Frames
New York City	15–45	12×12 to 30×30	29,000
Lisbon Downtown	10–45	12×12 to 30×50	9,000
Downtown City	8–30	12×12 to 30×30	27,000

Provides 3D monocular pedestrian keypoints and absolute agent positions; total of 100,700 annotated frames.

3.3 Omni360-WayPoint

Scene	#Routes	Length range (m)	$v_{max}$ (m/s)	$a_{max}$ (m/s²)
City Park	20,000	[50,150]	16	3
Downtown W.	5,000	[20,50]	16	3

Enables benchmarking of trajectory-following tasks with strict dynamic feasibility.

4. Experimental Benchmarks

AirSim360 data were utilized for benchmarking across several core tasks, evaluating transferability and upstream gains when integrated into existing models.

4.1 Monocular Pedestrian Distance Estimation

Using the MonoLoco++ baseline, networks trained on nuScenes + Omni360-Human produce lower angular errors compared to nuScenes alone. Specifically, angular error on the FreeMan test set was reduced from $17.0^\circ$ to $11.6^\circ$ , a $\sim20\%$ improvement.

Train Set	Test Set	Dist. Err (m)	Ang. Err (°)
nuScenes	KITTI	0.822	31.5
nuScenes+Omni360	KITTI	0.809	31.2
…	FreeMan	0.260→0.228	17.0→11.6

4.2 Panoramic Depth Estimation

Training UniK3D on Omni360 outperforms Deep360 baseline on average relative error ( $\mathrm{AbsRel} = 5.437$ vs. $8.257$) and $\delta_1$ ratio ($0.399$ vs. $0.349$) when evaluated on SphereCraft.

Train Data	Test Data	AbsRel ↓	$\delta_1$ ↑
Deep360	SphereCraft	8.257	0.349
Omni360	SphereCraft	5.437	0.399

4.3 Panoramic Segmentation

Augmenting WildPASS training data with Omni360-Scene increases semantic mIoU from $58.0$ to $67.4$ and entity mAP from $24.6$ to $38.9$.

Data	Sem. mIoU ↑	Ent. mAP ↑
WildPASS only	58.0	24.6
+Omni360-Scene	67.4	38.9

Evaluation on panoramic VLN using qwen2.5-vl-72b vs. doubao-seed-1 models:

Model	SR	SPL	NE
qwen2.5-vl-72b	0.40	0.38	18099
doubao-seed-1	0.50	0.48	10573

Where SR is success rate, SPL is success weighted by path length, and NE is navigation error.

5. Comparison with Legacy Simulators

Previous simulators such as AirSim (Shah et al., 2017) offer 360° visual simulation via a UE4 plugin encompassing modular vehicle models, a physics engine, and extensible sensor interfaces (IMU, barometer, GPS, RGB, depth). While AirSim supports synchronized capture of pinhole or cubemap camera arrays and can yield equirectangular projections by scriptable post-processing, native omnidirectional support is limited:

Manual configuration of six orthogonal cameras via settings.json and post-run panorama stitching is required
Depth, segmentation, and associated ground truths require transformation from the perspective to equirectangular domain, often introducing alignment errors
No built-in interactive human simulation or annotation at the keypoint level
Trajectory planning is left to user scripts, without built-in minimum-snap optimization

AirSim360 directly addresses these limitations via native panoramic rendering, intrinsic render-aligned ground-truth, built-in agent simulation with annotation, and trajectory planners specifically tailored to UAV perception needs.

6. Limitations and Future Directions

The current AirSim360 release is scoped to urban/outdoor environments, with no imposed weather, wind, LiDAR, or thermal imaging effects. Pedestrian simulation, while interactive and annotated at skeletal granularity, could benefit from more richly parameterized social force or interaction models. The roadmap prioritizes:

Integration of environmental disturbances (weather, wind)
Expansion toward multi-UAV coordinated operations and adversarial scenarios
Release of plugins for additional sensors: LiDAR, event cameras, hyperspectral imaging
Direct support for reinforcement-learning agents in closed-loop navigation and perception tasks
Public dissemination of the entire toolchain and Omni360-X dataset to catalyze adoption and enable comparative evaluation across the community

AirSim360 constitutes the first UAV simulation suite to consistently realize render-aligned, large-scale, annotated, and interactive omnidirectional image generation, establishing a new baseline for aerial panoramic perception research (Ge et al., 1 Dec 2025).

Markdown Upgrade to Chat

References (2)

AirSim360: A Panoramic Simulation Platform within Drone View (2025)

AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AirSim360.

AirSim360: 360° UAV Simulation Platform

1. Motivation and Context

2. Architecture and Core Modules

Render-Aligned Data and Labeling

Interactive Pedestrian-Aware System

Automated Trajectory Generation

3. Dataset Composition and Annotations

3.1 Omni360-Scene

3.2 Omni360-Human

3.3 Omni360-WayPoint

4. Experimental Benchmarks

4.1 Monocular Pedestrian Distance Estimation

4.2 Panoramic Depth Estimation

4.3 Panoramic Segmentation

4.4 Vision-Language Navigation

5. Comparison with Legacy Simulators

6. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

AirSim360: 360° UAV Simulation Platform

1. Motivation and Context

2. Architecture and Core Modules

Render-Aligned Data and Labeling

Interactive Pedestrian-Aware System

Automated Trajectory Generation

3. Dataset Composition and Annotations

3.1 Omni360-Scene

3.2 Omni360-Human

3.3 Omni360-WayPoint

4. Experimental Benchmarks

4.1 Monocular Pedestrian Distance Estimation

4.2 Panoramic Depth Estimation

4.3 Panoramic Segmentation

4.4 Vision-Language Navigation

5. Comparison with Legacy Simulators

6. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics