Papers
Topics
Authors
Recent
2000 character limit reached

AirSim360: 360° UAV Simulation Platform

Updated 3 December 2025
  • AirSim360 is a simulation platform that generates native 360° panoramic images with pixel- and entity-level annotations for UAV perception and mapping tasks.
  • It leverages Unreal Engine 5 for synchronized multisensor capture, integrating automated minimum-snap trajectory planning and interactive pedestrian simulation.
  • The system addresses data scarcity and annotation challenges by providing scalable, render-aligned benchmarks for omnidirectional vision in robotics and navigation.

AirSim360 is a simulation platform incorporating native 360-degree omnidirectional image generation, ground-truth labeling, interactive agent simulation, and automated trajectory planning from drone viewpoints, specifically designed to address the data scarcity and annotation challenges in omnidirectional vision for robotics, mapping, and navigation tasks. The system builds on Unreal Engine 5 (UE5) and presents synchronized multisensor scene capture, pixel- and entity-level ground truth, and scalable flight data collection, facilitating aerial benchmarking for perception and autonomy research requiring true 4D (space-time) panoramic realism (Ge et al., 1 Dec 2025).

1. Motivation and Context

Emerging robotic and computer vision applications demand omnidirectional “spatial intelligence,” including tasks like panoramic mapping, agent-centric navigation, and vision-language grounding. Existing panoramic datasets are limited both in scale (typically only a few thousand annotated images) and in annotation density, due to the high overhead of pixel- and instance-level manual labeling. Predecessor UAV simulators—including AirSim (Shah et al., 2017), CARLA, and UnrealZoo—lack native support for panoramic (360°) image rendering, typically resorting to successive camera rotations and manual frame reassembly. These techniques introduce inefficiencies and misalignments, especially when transferring ground-truth modalities such as depth or semantic segmentations between perspective and equirectangular/cubemap domains.

AirSim360 addresses these gaps by introducing:

  • Render-aligned omnidirectional image and annotation generation (direct panoramic rendering with one-to-one modality correspondence)
  • Interactive simulation of pedestrian agents with dense skeletal keypoint annotation
  • Automated, minimum-snap trajectory generation for scalable and diverse data acquisition The platform enables the systematic simulation of the real-world at both the scene and event levels in a 4D omnidirectional setting.

2. Architecture and Core Modules

AirSim360’s architecture consists of three tightly coupled components:

  1. UE-based Rendering and Flight Control Loop: Implements synchronized multisensor capture (six monocular pinhole cameras for full 360° coverage, flight control, and recording).
  2. Offline Data Collection Toolkit: Handles GPU-side stitching of raw cubemap images into panoramic (equirectangular) images, automated association with pixel- and entity-level ground truth, and quality assurance through alignment metrics.
  3. Interactive Pedestrian-Aware System (IPAS): Deploys multiple non-player character (NPC) pedestrians controlled by behavior trees and state machines, with real-time streaming of 3D keypoints and option for social force field modeling.

Render-Aligned Data and Labeling

  • Uses six 90° FOV virtual pinhole cameras oriented along ±X, ±Y, and ±Z, yielding simultaneous outputs {Ic}c=16\{I_c\}_{c=1}^6 (each of size Hc×WcH_c\times W_c), GPU-stitched into an equirectangular image IeRHe×We×3I_e \in \mathbb{R}^{H_e\times W_e \times 3}. Spherical projection and cubemap mapping ensure a bijective pixel correspondence.
  • Pixelwise depth is computed as D(p)=X(p)C2D(p) = \|\mathbf{X}(p) - \mathbf{C}\|_2, where C\mathbf{C} is the camera center, and X(p)\mathbf{X}(p) is the 3D world point, extracted from the UE z-buffer.
  • Semantic segmentation leverages the stencil buffer with mesh integer labels mapped to semantic categories.
  • Entity segmentation performs two-pass labeling to resolve the 256-label stencil buffer limit, providing per-instance IDs for all actors (static meshes, skeletal meshes, landscapes).
  • Render-ground-truth alignment is enforced for each pixel via the loss

Lalign=1Ni=1NR(pi)G(pi)2,L_{\mathrm{align}} = \frac{1}{N}\sum_{i=1}^N \|R(p_i) - G(p_i)\|^2,

where RR denotes the rendered modality and GG the saved ground-truth, used in QA/calibration.

Interactive Pedestrian-Aware System

  • NPC behaviors (walk, chat, phone-call) are controlled via state machines and event-driven dispatchers (supporting events such as “meet,” “collide,” “idle”).
  • Real-time skeleton keypoint annotation is streamed for each agent by reading bone transformations, with custom joint definitions possible using UE add-socket mechanisms.
  • Optional Social Force Model overlays inter-agent repulsion and target-directed behaviors:

midvidt=mi(vi0eivi)τi+jifij+obsfiobsm_i \frac{d\mathbf{v}_i}{dt} = \frac{m_i(v_i^0\mathbf{e}_i - \mathbf{v}_i)}{\tau_i} + \sum_{j\ne i}\mathbf{f}_{ij} + \sum_{\mathrm{obs}} \mathbf{f}_{i\mathrm{obs}}

where fij\mathbf{f}_{ij} is calibrated for physical plausibility within UE’s physics engine.

Automated Trajectory Generation

  • Implements the Minimum-Snap trajectory framework ([Mellinger & Kumar, 2011]), framing each path segment between waypoints {p0,,pM}\{p_0,\dots,p_M\} as a 5th-order polynomial

pi(t)=k=05ai,ktk,t[0,Ti]p_i(t) = \sum_{k=0}^5 a_{i,k} t^k, \quad t\in[0, T_i]

with a snap cost

J=0Ttotd4pdt42dt,J = \int_0^{T_{\mathrm{tot}}} \left\| \frac{d^4p}{dt^4} \right\|^2 dt,

and subject to speed and acceleration constraints (p˙(t)vmax\|\dot{p}(t)\|\leq v_{\max}, p¨(t)amax\|\ddot{p}(t)\|\leq a_{\max}). The resulting quadratic program is solved per-trajectory, enabling efficient, diverse path sampling.

Pseudocode: Minimum-Snap Trajectory Generation

1
2
3
4
5
6
7
8
9
10
Algorithm 1: Minimum‐Snap Trajectory Generation
Input: waypoints W={w0…wM}, v_max, a_max, sampling Δt
Output: time‐parameterized S(t)=(p(t),v(t),a(t))
1: Partition total flight time into segments [0,T1],…,[T_{M-1},T_M].
2: For each segment i:
     – Build polynomial basis p_i(t)=a_{i,0}+…+a_{i,5} t^5.
3: Assemble Q from ∫ (d^4p/dt^4)^2 dt, and constraint matrices A, b.
4: Solve QP for coefficients a_i.
5: Sample t=0:Δt:T_M to obtain p(t), v(t)=p'(t), a(t)=p''(t).
6: Return S(t).

3. Dataset Composition and Annotations

The Omni360-X dataset suite generated via AirSim360 consists of three principal benchmarks:

3.1 Omni360-Scene

Scene Area (m²) #Images #Semantic Categories
City Park 800,000 25,600 25
Downtown West 60,000 6,800 29
SF City 250,000 22,000 20
New York City 44,800 6,600 25

Each panorama provides dense pixelwise depth, semantic segmentation ($25$–$29$ classes), and per-entity segmentation (over 1 million instances per scene).

3.2 Omni360-Human

Scene NPC Count Area (m × m) Frames
New York City 15–45 12×12 to 30×30 29,000
Lisbon Downtown 10–45 12×12 to 30×50 9,000
Downtown City 8–30 12×12 to 30×30 27,000

Provides 3D monocular pedestrian keypoints and absolute agent positions; total of 100,700 annotated frames.

3.3 Omni360-WayPoint

Scene #Routes Length range (m) vmaxv_{max} (m/s) amaxa_{max} (m/s²)
City Park 20,000 [50,150] 16 3
Downtown W. 5,000 [20,50] 16 3

Enables benchmarking of trajectory-following tasks with strict dynamic feasibility.

4. Experimental Benchmarks

AirSim360 data were utilized for benchmarking across several core tasks, evaluating transferability and upstream gains when integrated into existing models.

4.1 Monocular Pedestrian Distance Estimation

Using the MonoLoco++ baseline, networks trained on nuScenes + Omni360-Human produce lower angular errors compared to nuScenes alone. Specifically, angular error on the FreeMan test set was reduced from 17.017.0^\circ to 11.611.6^\circ, a 20%\sim20\% improvement.

Train Set Test Set Dist. Err (m) Ang. Err (°)
nuScenes KITTI 0.822 31.5
nuScenes+Omni360 KITTI 0.809 31.2
FreeMan 0.260→0.228 17.0→11.6

4.2 Panoramic Depth Estimation

Training UniK3D on Omni360 outperforms Deep360 baseline on average relative error (AbsRel=5.437\mathrm{AbsRel} = 5.437 vs. $8.257$) and δ1\delta_1 ratio ($0.399$ vs. $0.349$) when evaluated on SphereCraft.

Train Data Test Data AbsRel ↓ δ1\delta_1
Deep360 SphereCraft 8.257 0.349
Omni360 SphereCraft 5.437 0.399

4.3 Panoramic Segmentation

Augmenting WildPASS training data with Omni360-Scene increases semantic mIoU from $58.0$ to $67.4$ and entity mAP from $24.6$ to $38.9$.

Data Sem. mIoU ↑ Ent. mAP ↑
WildPASS only 58.0 24.6
+Omni360-Scene 67.4 38.9

4.4 Vision-Language Navigation

Evaluation on panoramic VLN using qwen2.5-vl-72b vs. doubao-seed-1 models:

Model SR SPL NE
qwen2.5-vl-72b 0.40 0.38 18099
doubao-seed-1 0.50 0.48 10573

Where SR is success rate, SPL is success weighted by path length, and NE is navigation error.

5. Comparison with Legacy Simulators

Previous simulators such as AirSim (Shah et al., 2017) offer 360° visual simulation via a UE4 plugin encompassing modular vehicle models, a physics engine, and extensible sensor interfaces (IMU, barometer, GPS, RGB, depth). While AirSim supports synchronized capture of pinhole or cubemap camera arrays and can yield equirectangular projections by scriptable post-processing, native omnidirectional support is limited:

  • Manual configuration of six orthogonal cameras via settings.json and post-run panorama stitching is required
  • Depth, segmentation, and associated ground truths require transformation from the perspective to equirectangular domain, often introducing alignment errors
  • No built-in interactive human simulation or annotation at the keypoint level
  • Trajectory planning is left to user scripts, without built-in minimum-snap optimization

AirSim360 directly addresses these limitations via native panoramic rendering, intrinsic render-aligned ground-truth, built-in agent simulation with annotation, and trajectory planners specifically tailored to UAV perception needs.

6. Limitations and Future Directions

The current AirSim360 release is scoped to urban/outdoor environments, with no imposed weather, wind, LiDAR, or thermal imaging effects. Pedestrian simulation, while interactive and annotated at skeletal granularity, could benefit from more richly parameterized social force or interaction models. The roadmap prioritizes:

  • Integration of environmental disturbances (weather, wind)
  • Expansion toward multi-UAV coordinated operations and adversarial scenarios
  • Release of plugins for additional sensors: LiDAR, event cameras, hyperspectral imaging
  • Direct support for reinforcement-learning agents in closed-loop navigation and perception tasks
  • Public dissemination of the entire toolchain and Omni360-X dataset to catalyze adoption and enable comparative evaluation across the community

AirSim360 constitutes the first UAV simulation suite to consistently realize render-aligned, large-scale, annotated, and interactive omnidirectional image generation, establishing a new baseline for aerial panoramic perception research (Ge et al., 1 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to AirSim360.