CausalNav: Outdoor Semantic Navigation

Updated 4 July 2026

CausalNav is a long-term, language-guided outdoor navigation system that uses a continuously updated hierarchical semantic scene graph to integrate offline map data with live perception.
It employs open-vocabulary object tracking via YOLO-World and camera–LiDAR fusion for precise object localization and dynamic object filtering in real-time.
The system features a hierarchical planning stack combining retrieval-augmented generation, Dijkstra-based routing, and NMPC-CBF for local collision-aware control.

CausalNav is a long-term, language-guided embodied navigation system for autonomous mobile robots in dynamic outdoor environments. It is presented as the first scene graph-based semantic navigation framework tailored for dynamic outdoor environments, and it centers on a continuously updated hierarchical semantic scene graph, the Embodied Graph, which integrates coarse-grained offline map data with fine-grained online object entities. The graph functions simultaneously as a scene representation, a retrievable knowledge base for Retrieval-Augmented Generation, and a planning substrate for open-vocabulary semantic navigation, long-range route generation, and local motion in the presence of dynamic objects (Duan et al., 5 Jan 2026).

1. Problem formulation and system scope

CausalNav is motivated by a gap between traditional outdoor navigation systems and the requirements of long-horizon semantic navigation in real environments. Traditional outdoor systems are described as point-to-point and map-dependent, with limited semantic reasoning, while most semantic or visual-language navigation systems are described as being developed for indoor, relatively static environments with limited environmental change. CausalNav addresses open-vocabulary, long-horizon, semantically grounded navigation in outdoor scenes that change over time, including moving vehicles and pedestrians, stale map priors, and the need to reconcile offline map data with live perception (Duan et al., 5 Jan 2026).

The paper organizes the system into three stages: open-vocabulary object tracking and ego-motion estimation, dynamic object filtering and Embodied Graph construction, and Embodied Graph updating and human language navigation. The intended operating context is large-scale outdoor space, where navigation must move across different semantic and spatial granularities, from building-level destinations to object-level targets such as “the fire hydrant beside the library.” The design premise is that robust navigation requires more than geometric localization or reactive control; it requires a persistent semantic memory that can be queried by language and continuously revised as the environment changes (Duan et al., 5 Jan 2026).

A central implication is that CausalNav’s use of “causal” is operational and architectural rather than a claim of Pearl-style causal identification. The system’s contribution lies in graph-based semantic memory, hierarchical retrieval, and dynamic updating for outdoor robotics, rather than in backdoor adjustment or structural intervention-effect estimation.

2. Perception, localization, and graph construction pipeline

CausalNav receives RGB images, LiDAR point clouds, IMU data, and, in the real system, RTK GNSS/INS. For semantic perception it uses YOLO-World for open-vocabulary object detection on each RGB frame $I_t$ , with ByteTrack for temporal association. The tracked objects at time $t$ are defined as

$\mathcal{S}_t = \left\{ S_i = \left(c_i,\text{2DBBox}_i,\mathcal{B}_i\right)\mid i=1,2,\ldots,n \right\}.$

Here $c_i$ is the object description, $\text{2DBBox}_i$ is the 2D bounding box, and $\mathcal{B}_i$ is the segmentation mask (Duan et al., 5 Jan 2026).

Because depth cameras are described as unreliable outdoors, CausalNav localizes objects through camera–LiDAR fusion. The LiDAR point cloud is

$\mathcal{P}=\left\{\mathbf{P}_i=(x_i,y_i,z_i)\right\}_{i=1}^N,$

and each point is projected into the image plane by

${}^{c}\mathbf{p}_i=\mathbf{K}\cdot \mathbf{H}\cdot \mathbf{P}_i,$

where $\mathbf{K}$ is the camera intrinsic matrix and $\mathbf{H}$ is the calibrated LiDAR-camera extrinsic transform. Points whose image projections fall inside an object mask define the object-specific 3D cloud,

$t$ 0

A minimum-volume 3D bounding box is fitted to this cloud, and its centroid gives the object position (Duan et al., 5 Jan 2026).

Robot localization is supplied by FAST-LIO2. If the ego pose in the world frame is $t$ 1 and the object pose in the LiDAR frame is $t$ 2, then the object pose in the world frame is

$t$ 3

Each localized object becomes an object node

$t$ 4

Graph insertion and maintenance follow the update rule

$t$ 5

This yields a graph that is continuously corrected by new observations rather than treated as static (Duan et al., 5 Jan 2026).

The ego-vehicle state is stored as

$t$ 6

with orientation, position, and velocity. When the robot moves farther than a threshold $t$ 7, a new ego node is added,

$t$ 8

and linked to the previous one by a trajectory edge

$t$ 9

These ego nodes and edges preserve historical trajectory structure for later global planning (Duan et al., 5 Jan 2026).

3. The Embodied Graph as hierarchical semantic memory

The Embodied Graph is the central representational object in CausalNav. It is explicitly described as both a scene operator and a memory tank, preserving scene, object, event, and temporal information for retrieval and planning. Its node types include object nodes $\mathcal{S}_t = \left\{ S_i = \left(c_i,\text{2DBBox}_i,\mathcal{B}_i\right)\mid i=1,2,\ldots,n \right\}.$ 0, ego-vehicle nodes $\mathcal{S}_t = \left\{ S_i = \left(c_i,\text{2DBBox}_i,\mathcal{B}_i\right)\mid i=1,2,\ldots,n \right\}.$ 1, building nodes $\mathcal{S}_t = \left\{ S_i = \left(c_i,\text{2DBBox}_i,\mathcal{B}_i\right)\mid i=1,2,\ldots,n \right\}.$ 2, and hierarchical clustering nodes $\mathcal{S}_t = \left\{ S_i = \left(c_i,\text{2DBBox}_i,\mathcal{B}_i\right)\mid i=1,2,\ldots,n \right\}.$ 3 (Duan et al., 5 Jan 2026).

The graph is hierarchical. Coarse-grained map entities such as buildings occupy the highest semantic level $\mathcal{S}_t = \left\{ S_i = \left(c_i,\text{2DBBox}_i,\mathcal{B}_i\right)\mid i=1,2,\ldots,n \right\}.$ 4, while fine-grained object entities occupy level $\mathcal{S}_t = \left\{ S_i = \left(c_i,\text{2DBBox}_i,\mathcal{B}_i\right)\mid i=1,2,\ldots,n \right\}.$ 5. Building nodes are extracted from offline maps as

$\mathcal{S}_t = \left\{ S_i = \left(c_i,\text{2DBBox}_i,\mathcal{B}_i\right)\mid i=1,2,\ldots,n \right\}.$ 6

and cluster nodes are defined by

$\mathcal{S}_t = \left\{ S_i = \left(c_i,\text{2DBBox}_i,\mathcal{B}_i\right)\mid i=1,2,\ldots,n \right\}.$ 7

where $\mathcal{S}_t = \left\{ S_i = \left(c_i,\text{2DBBox}_i,\mathcal{B}_i\right)\mid i=1,2,\ldots,n \right\}.$ 8 is a cluster of lower-level nodes, $\mathcal{S}_t = \left\{ S_i = \left(c_i,\text{2DBBox}_i,\mathcal{B}_i\right)\mid i=1,2,\ldots,n \right\}.$ 9 is the semantic summary generated by the LLM, and $c_i$ 0 is the mean spatial position of the cluster. Parent–child edges link each cluster node to its member object nodes (Duan et al., 5 Jan 2026).

Hierarchical clustering is driven by a joint spatial-semantic similarity

$c_i$ 1

with

$c_i$ 2

and $c_i$ 3 defined as the cosine similarity between node embeddings $c_i$ 4 and $c_i$ 5. This means that graph aggregation is not purely geometric: nearby semantically compatible objects are grouped together, and semantically similar labels can be associated even when their lexical forms differ (Duan et al., 5 Jan 2026).

The resulting graph stores at least three classes of information. Spatial information includes world positions, trajectory connectivity, cluster centroids, and proximity relations. Semantic information includes object labels from open-vocabulary detection, building names or types from offline maps, and LLM-generated cluster summaries. Temporal information includes tracked dynamic-object trajectories, historical ego-vehicle paths, and online updates over time. This combination makes the Embodied Graph both a semantic world model and a long-term memory structure (Duan et al., 5 Jan 2026).

4. Language grounding, retrieval, and hierarchical planning

CausalNav uses the Embodied Graph as a retrievable knowledge base for Retrieval-Augmented Generation. Given a natural-language query $c_i$ 6, retrieval proceeds hierarchically. At level $c_i$ 7, the probability of selecting node $c_i$ 8 is

$c_i$ 9

where $\text{2DBBox}_i$ 0 is the node description and $\text{2DBBox}_i$ 1 controls the sharpness of the distribution. A hierarchical retrieval path $\text{2DBBox}_i$ 2 is scored by

$\text{2DBBox}_i$ 3

with

$\text{2DBBox}_i$ 4

If the robot’s current location $\text{2DBBox}_i$ 5 is known, candidates are reranked by

$\text{2DBBox}_i$ 6

where

$\text{2DBBox}_i$ 7

Retrieval therefore combines semantic compatibility with spatial plausibility (Duan et al., 5 Jan 2026).

After target grounding, CausalNav separates global and local planning. If the target lies on or is connected to the robot’s historical trajectory in the graph, a Dijkstra-based shortest path is computed. Otherwise, the system generates a coarse route using offline road map data or external APIs such as Google Maps or Amap. The resulting waypoint sequence is

$\text{2DBBox}_i$ 8

This stage handles long-range movement across large outdoor environments by combining graph memory with existing map priors (Duan et al., 5 Jan 2026).

Local motion uses RH-Map for real-time 3D local mapping with dynamic-object removal. Within the feasible region $\text{2DBBox}_i$ 9, the system computes an initial path using Informed-RRT*,

$\mathcal{B}_i$ 0

then smooths it with B-spline interpolation and attaches heading $\mathcal{B}_i$ 1 to obtain the reference trajectory

$\mathcal{B}_i$ 2

Control is implemented by NMPC-CBF, which solves

$\mathcal{B}_i$ 3

subject to system dynamics, initial state, state and control constraints, and the control-barrier constraint

$\mathcal{B}_i$ 4

The safe set around dynamic obstacle $\mathcal{B}_i$ 5 is typically defined by

$\mathcal{B}_i$ 6

The hierarchy therefore runs from language grounding, to graph retrieval, to route generation, to local collision-aware control (Duan et al., 5 Jan 2026).

5. Dynamic-object handling and online graph maintenance

A distinguishing feature of CausalNav is that dynamic objects are treated explicitly during both graph construction and motion planning. Motion classification combines CenterPoint for real-time BEV point-cloud detection with LIOsegmot for velocity estimation, and objects are categorized as dynamic, static, or quasi-static (Duan et al., 5 Jan 2026).

The graph-maintenance mechanism is the spatial-temporal corridor, defined from historical object bounding boxes as

$\mathcal{B}_i$ 7

This corridor records trajectory and shape across time. When an object exceeds a displacement threshold of $\mathcal{B}_i$ 8 steps, its corridor is excluded and dynamic nodes are removed from graph construction:

$\mathcal{B}_i$ 9

The system therefore does not simply threshold instantaneous velocity; it uses temporal persistence to decide whether an observed entity should be retained in the graph’s semantic memory (Duan et al., 5 Jan 2026).

Algorithm 1, Online Embodied Graph Updating, describes the operational loop. The graph is initialized as $\mathcal{P}=\left\{\mathbf{P}_i=(x_i,y_i,z_i)\right\}_{i=1}^N,$ 0, scene observations are computed continuously, each object is localized in the world frame and inserted or updated, dynamic-object corridors are removed once motion exceeds the threshold $\mathcal{P}=\left\{\mathbf{P}_i=(x_i,y_i,z_i)\right\}_{i=1}^N,$ 1, ego nodes and edges are updated, hierarchical clustering is recomputed, and new cluster nodes and edges are added. The system repeats this loop continuously until shutdown. The abstract and method also state that the Embodied Graph is continuously updated within a temporal window, although no explicit window-length formula is provided (Duan et al., 5 Jan 2026).

The module frequencies are reported concretely. Open-vocabulary object tracking and ego-motion run at 30 Hz, spatio-temporal corridor filtering at 20 Hz, local dynamic mapping and planning at 10 Hz, and hierarchical clustering and Embodied Graph updates at 1 Hz. Mean per-cycle latency is 105 ms, compared with 95 ms for NoMaD, 150 ms for ViNT, 110 ms for GNM, and 180 ms for CityWalker. This places dynamic graph maintenance within a real-time outdoor robotics loop rather than as an offline mapping stage (Duan et al., 5 Jan 2026).

6. Empirical results, comparative position, and limitations

CausalNav is evaluated in both simulation and real-world outdoor settings. The simulation uses a Gazebo-based urban environment with a ground robot, RealSense D435i, 3D LiDAR, pedestrians, and vehicles. Evaluation uses 25 randomly sampled tasks with 10 trials each, divided into short-range, medium-range, and long-range categories. A task is successful if the robot reaches within 10 meters of the target. Metrics are Success Rate (SR), Success weighted by Path Length (SPL), Collision Count (CC), and Trajectory Length (TL) (Duan et al., 5 Jan 2026).

On small tasks, CausalNav achieves SR 100, SPL 88.9, CC 0.2, and TL 40.66. CityWalker also achieves SR 100, but with SPL 82.4 and CC 1.2. On medium tasks, CausalNav records SR 92, SPL 82.2, CC 0.6, and TL 83.16, compared with CityWalker’s SR 85, SPL 73.6, and CC 3.4. On large tasks, CausalNav reaches SR 80, SPL 66.0, CC 1.2, and TL 141.82; CityWalker also reaches SR 80 but with CC 4.5, while ViNT falls to SR 48 and GNM to SR 0 (Duan et al., 5 Jan 2026).

The online-update ablation is one of the strongest reported effects. Without Embodied Graph updates, performance is SR 78, SPL 54.7, CC 1.8, TL 120.35. With Embodied Graph updates, performance improves to SR 90, SPL 80.1, CC 1.1, TL 98.25. The system also compares local and API-based LLMs: phi4-14B reaches SR 83, SPL 69.5, CC 1.1, TL 106.27; DeepSeek-R1-Distill-14B reaches SR 85, SPL 72.1, CC 1.1, TL 103.63; and GPT-4o reaches SR 88, SPL 75.3, CC 1.0, TL 103.24. The reported interpretation is that GPT-4o is best but only modestly better than local open-source models because hierarchical retrieval over the Embodied Graph reduces hallucination (Duan et al., 5 Jan 2026).

Real-world experiments use a wheeled mobile robot with Intel Core i9-13900H CPU, NVIDIA RTX 4070 GPU, RealSense D435i, RSHelios lidar, and RTK GNSS/INS at 10 Hz with 5 cm accuracy. In a short-range object-level instruction of approximately 130 m, only ViNT and CausalNav succeed. In a long-range building-level instruction of approximately 512 m, only CausalNav succeeds; all other systems fail, mainly due to collisions. The paper identifies sensitivity to lighting and environmental changes, and weak handling of dynamic obstacles, as key failure modes of other systems under real outdoor conditions (Duan et al., 5 Jan 2026).

Within the broader literature, CausalNav occupies a different niche from indoor causal-navigation methods and temporal causal modeling tools. CausalVLN formulates vision-and-language navigation as deconfounded multimodal representation learning using backdoor adjustment over visual and linguistic confounders in R2R, RxR, and REVERIE (Wang et al., 2024). Causality-Aware Navigation emphasizes one-step action-conditioned transition understanding as an auxiliary objective across RoboTHOR, Habitat, and R2R (Wang et al., 18 Jun 2025). By contrast, CausalNav is organized around a scene graph-based outdoor semantic memory and hierarchical planning stack. More generally, methods such as NAVAR learn Granger-causal influence graphs from multivariate time series (Bussmann et al., 2020), and graphical time-series models formalize intervention effects over delayed dynamical systems (Eichler et al., 2012), but neither class of work provides CausalNav’s outdoor semantic navigation architecture.

The limitations reported for CausalNav concern scalability of graph and memory as environments grow, robustness under extreme lighting or weather, and long-horizon consistency over very long-term deployment. The paper also identifies future directions in graph compression, improved memory recall, stronger multimodal fusion, and lifelong learning and exploration. A plausible implication is that CausalNav’s central contribution is architectural integration: it links open-vocabulary perception, semantic memory, graph retrieval, and dynamic local planning into a single long-term outdoor navigation system, while leaving formal uncertainty modeling and very-long-term memory management as open problems (Duan et al., 5 Jan 2026).