Aerial ObjectNav Agent (AOA)

Updated 3 July 2026

AOA is an autonomous aerial system that uses vision and language to locate semantic objects in complex 3D environments.
It integrates modules for semantic perception, spatio-temporal memory, and hierarchical planning to address partial observability and safety constraints.
Performance metrics like Success Rate and Success-weighted Path Length in simulation and field trials drive improvements in AOA architecture.

An Aerial ObjectNav Agent (AOA) is an autonomous aerial robotic system designed to locate and approach arbitrary semantic objects in complex 3D environments using high-level perceptual input, typically vision and language modalities. The AOA paradigm is defined by the agent’s ability to interpret instance-level goal descriptions, perceive the environment from an egocentric or overhead perspective, and execute a long-horizon search-and-approach behavior under real-world constraints such as partial observability, ambiguous goals, and safety-critical flight dynamics. State-of-the-art AOA systems incorporate modules for semantic perception, spatio-temporal memory, hierarchical planning, and robust visual grounding, and are benchmarked in simulation and real settings using standardized metrics including Success Rate (SR) and Success-weighted Path Length (SPL).

1. Formal Problem Formulation and AOA Taxonomy

Aerial Object Goal Navigation (ObjectNav) extends classical navigation challenges to 3D search in unstructured, large-scale settings, leveraging Unmanned Aerial Vehicles (UAVs) for spatial efficiency and semantic flexibility. The formal task specifies an agent equipped with on-board sensors (RGB(-D) cameras, inertial, optionally Lidar) and an instance-level semantic goal $c$ comprising at minimum an object category, spatial attributes, and/or free-form description (e.g., “Object: Statue; Size: 2.75×2.50 units; Description: Oxidized bronze seated figure on dark stone pedestal.”) (Xiao et al., 1 Aug 2025). At each timestep $t$ , the AOA receives its current observation $s_t$ (sensor frames, proprioception, history), and must select an action $a_t$ to maximize success probability:

State space: compound of egocentric RGB(-D), pose $[x_t, y_t, z_t, \psi_t]$ , semantic goal $c$ .
Action space: discrete or parameterized primitives (e.g., MoveForward, Ascend, RotateLeft, Stop, PickUp).
Observability: partial; no oracle maps, often unknown initial pose.
Evaluation:

$\mathrm{SR} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}[d_i \leq \tau], \quad \mathrm{SPL} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}[d_i \leq \tau] \cdot \left(\frac{l_i}{\max(p_i, l_i)}\right)$

where $d_i$ is final distance to target, $l_i$ is shortest path, $p_i$ agent path length (Xiao et al., 1 Aug 2025).

The role of AOA is distinguished from ground-based ObjectNav by full 6-DoF motion, complex occlusion, and richer 3D spatial reasoning. AOA strategies bifurcate into zero-shot prompt-driven architectures (Xiao et al., 1 Aug 2025), classical pipeline and FSM approaches (Bähnemann et al., 2017), modular policy RL (Zhang et al., 31 Jan 2026, Yan et al., 22 Jan 2026), and communication-centric frameworks (Dorbala et al., 2024).

2. Canonical Architectures and Core Modules

Several paradigms have emerged for AOA realization:

2.1. Zero-Shot Prompt Architectures

The modular AOA baseline (Xiao et al., 1 Aug 2025) fuses frozen large vision-LLMs (Qwen-VL for image-to-text, GPT-4o multi-modal LLM for planning). The pipeline consists of:

Perception Encoding: Multi-view synchronized RGB(-D) processing into textual captions and compact spatial grids.
Instance-Level Goal Encoding: Structured prompts concatenated with “You are a UAV…” task description.
Context Aggregation: Integration of view captions, depth grids, recent proprioceptive history, and goal into a monolithic prompt $t$ 0 for LLM planning.
LLM Planner: One-shot mapping $t$ 1, issuing parameterized motion commands.

No components are fine-tuned; all reasoning and perception is offloaded to pre-trained models.

2.2. Hierarchical RL-Memory Architectures

APEX (Zhang et al., 31 Jan 2026) exemplifies a more structured approach: asynchronous modules for spatial mapping, RL-based control, and target identification.

Dynamic Spatio-Semantic Mapping Memory:

High-resolution 3D grids: Attraction Map $t$ 2 (semantic goal proximity, via VLM + segmentation), Exploration Map $t$ 3 (covers visited/known/unexplored state), Obstacle Map $t$ 4 (occupancy). Maps are indexed and updated via geometric back-projection and semantic VLM outputs.

Action Decision Module:

Markov Decision Process: state encapsulates dynamic map slices and UAV pose. Trained with PPO on composite rewards:

$t$ 5

with reward shaping for semantic “attraction” and exploration coverage.

Target Grounding Module:

Open-vocabulary (open-vocab) visual detectors confirm “last-meter” goal, supporting robust final localization.

2.3. Dual-Policy RL

AION (Yan et al., 22 Jan 2026) employs separate RL policies for exploration (AION-e: maximizing free coverage and safety using DINOv2/depth features) and goal localization (AION-g: goal-conditional navigation using CLIP-guided semantic attention), with online policy switching triggered by object detection.

2.4. Classical FSM and Multirobot Systems

Decentralized ObjectNav (Bähnemann et al., 2017) leverages state estimation (VIO + RTK-GPS), sweep coverage, NMPC collision avoidance with limited communication, and closed-loop pose-based visual servoing—enabling coordinated multi-UAV object search and pickup in field environments.

3. Semantic Goal Representation and Perceptual Fusion

Rich semantic goal encoding is central to AOA efficacy. Current best practices (Xiao et al., 1 Aug 2025, Zhang et al., 31 Jan 2026, Yan et al., 22 Jan 2026):

Structured goals: Categorical label, physical attributes (planimetric size), and free-form descriptors.
Goal-Prompt Integration: Prepending instance goal descriptions to sensory input summaries and agent state.
Vision-LLM (VLM) integration: Zero-shot VLMs (e.g., Qwen-VL, CLIP, VLM+segmentation) generate either attention-weighted semantic similarity maps or textual captions, directly aligning perceptual streams with the semantic goal.
Multimodal fusion: RL and LLM planners fuse semantic and geometric elements; in APEX, VLM output populates $t$ 6; in AION, CLIP similarity maps $t$ 7 modulate goal-reaching head (Zhang et al., 31 Jan 2026, Yan et al., 22 Jan 2026).

4. Control, Memory, and Learning Paradigms

AOA control modules employ varying degrees of hierarchy, asynchrony, and learning algorithms.

Architecture	Memory Type	Control Paradigm	Learning / Inference
APEX (Zhang et al., 31 Jan 2026)	3D grid, semantic	PPO in MDP, asynch RL+detector	RL+pretrained VLM
AION (Yan et al., 22 Jan 2026)	Implicit (LSTM)	Dual-policy A3C RL	RL (A3C), CLIP, DINOv2
AOA (UAV-ON) (Xiao et al., 1 Aug 2025)	Prompt/LLM	Prompted LLM	Pretrained VLM + LLM
Decentralized FSM (Bähnemann et al., 2017)	None	FSM + NMPC + vision servo	Classical estimation

Concrete learning setups:

RL-based MDP/POMDP: Proximal Policy Optimization (PPO) (Zhang et al., 31 Jan 2026), A3C (Yan et al., 22 Jan 2026), value- and policy-head separation, history LSTMs, reward shaping for semantic, exploration, and safety signals.
LLM-based Prompting: Full sensorimotor context is reduced to a single prompt for zero-shot inference (Xiao et al., 1 Aug 2025).
Classical methods: EKF for 3D object tracking, trajectory optimization, and coverage path planning (Bähnemann et al., 2017).

Asynchrony is often exploited (e.g., mapping at 1 Hz vs decision at 5–10 Hz (Zhang et al., 31 Jan 2026)) to balance computational cost of VLM processing, reactive safety, and semantic state updates.

5. Evaluation Benchmarks and Empirical Performance

Evaluation employs both large-scale photo-realistic simulation and real-world robotics competitions:

UAV-ON Benchmark: 14 Unreal Engine environments, 1,270 annotated target objects, variable visibility and spatial complexity. Metrics: SR, OSR, SPL, DTS. Baselines (Random, CLIP-H, AOA variants) yield SR ≈ 3.7–7.3%, SPL ≈ 0.87–4.15%; collision rates often exceed 45% (Xiao et al., 1 Aug 2025).
AI2-THOR / IsaacSim / ProcTHOR: AION achieves SR/SPL up to 95.0%/55.2% (unseen), CR as low as 2.3–7.6%, outpacing 2D or monolithic baselines (Yan et al., 22 Jan 2026).
Field Trials (MBZIRC): Classical decentral FSM-based AOA achieves ≈ 90% servo-pickup success, state RMSE ≈ 15 cm (Bähnemann et al., 2017).

Experimental insights:

Hierarchical and memory-rich designs (APEX) yield +4.2 % SR over prior SOTA, reduced latency and collision, and higher “safe distance to collision” (Zhang et al., 31 Jan 2026).
Explicit separation of exploration and goal-reaching (AION) is critical for strong generalization and safety.
Prompt-based AOA delivers high OSR but low SPL and poor collision performance.

6. Limitations, Failure Modes, and Prospective Directions

AOA systems face the following challenges across architectures:

Zero-Shot Generalization: LLM-based pipelines struggle with accurate stop-action and path efficiency; performance rarely exceeds low single-digit SR/SPL (Xiao et al., 1 Aug 2025).
Perceptual Fusion Bottlenecks: Overloading LLMs with multitask reasoning degrades both goal grounding and motion reliability.
Safety and Collision: Collision rates of 45–65 % in LLM-driven systems are prohibitive for real UAV deployment without additional reactive safety modules (Xiao et al., 1 Aug 2025).
Dependence on Visual Detectors: Quality of semantic goals and object detectors (e.g., YOLO in AION) is a critical bottleneck, with poor illumination or occlusions directly undermining policy success (Yan et al., 22 Jan 2026).
Scalability and Real-Time Constraints: Heavy VLM inference, map-building, and parallelism must be engineered to maintain control-rate reaction, with explicit asynchrony in architecture (e.g., APEX) (Zhang et al., 31 Jan 2026).
Lack of Explicit Global Mapping: Most mapless or implicit-memory methods still fall short of full environment coverage or robust long-horizon planning (Yan et al., 22 Jan 2026).

Prospective directions substantiated in the literature include:

Hybrid Architectures: Decoupling semantic reasoning (LLM/VLM) from low-level geometric and safety modules, integrating learned geometric mapping or semantic occupancy grids (Zhang et al., 31 Jan 2026, Xiao et al., 1 Aug 2025).
End-to-End Fine-Tuning: Leveraging supervised path learning or RL to align LLM outputs with expert (e.g., A*) navigational trajectories.
Safety-Critical Layers: Modular layering of collision avoidance and semantic target decision (APEX, AION).
Prompt/Reward Shaping: Empirical reduction of hallucination and action ambiguity via systematic prompt tuning or reward ablation (Dorbala et al., 2024, Yan et al., 22 Jan 2026).

7. Comparison Across Representative Systems

System	Semantic Integration	Planning/Control	Reported SR / SPL	Distinctive Feature	Reference
APEX	VLM-based maps	Hierarchical async RL + det	13.3% / 10.1%	Decoupled 3D grid mem., asynch modules	(Zhang et al., 31 Jan 2026)
AION	CLIP/DINOv2	Dual-policy A3C	95%* / 55.2%*	Decoupled explore/goal, real-time drone	(Yan et al., 22 Jan 2026)
UAV-ON AOA	Prompt + LLM	LLM function calls	7.3% / 4.06%	Zero-shot, multi-caption/depth fusion	(Xiao et al., 1 Aug 2025)
Classical Decentral AOA	FSM	NMPC/servoing/auction	2nd in MBZIRC	Full onboard S&R multi-UAV architecture	(Bähnemann et al., 2017)

*On synthetic indoor split; open-world performance lower.

AOA research thus emphasizes the integration of multi-modal semantic encoding, dynamic spatial memory, hierarchical decision-making, robust control policies, and proactive safety in 3D aerial settings, with current leading solutions substantiating the necessity of hybrid, asynchronous, and explicitly memory-rich designs for scalable open-world autonomy.