CausalNav: A Long-term Embodied Navigation System for Autonomous Mobile Robots in Dynamic Outdoor Scenarios

Published 5 Jan 2026 in cs.RO | (2601.01872v1)

Abstract: Autonomous language-guided navigation in large-scale outdoor environments remains a key challenge in mobile robotics, due to difficulties in semantic reasoning, dynamic conditions, and long-term stability. We propose CausalNav, the first scene graph-based semantic navigation framework tailored for dynamic outdoor environments. We construct a multi-level semantic scene graph using LLMs, referred to as the Embodied Graph, that hierarchically integrates coarse-grained map data with fine-grained object entities. The constructed graph serves as a retrievable knowledge base for Retrieval-Augmented Generation (RAG), enabling semantic navigation and long-range planning under open-vocabulary queries. By fusing real-time perception with offline map data, the Embodied Graph supports robust navigation across varying spatial granularities in dynamic outdoor environments. Dynamic objects are explicitly handled in both the scene graph construction and hierarchical planning modules. The Embodied Graph is continuously updated within a temporal window to reflect environmental changes and support real-time semantic navigation. Extensive experiments in both simulation and real-world settings demonstrate superior robustness and efficiency.

Abstract PDF Upgrade to Chat

Summary

The paper introduces CausalNav, a navigation system that integrates dynamic embodied graphs and LLM-based semantic reasoning for outdoor robotic autonomy.
The paper leverages open-vocabulary object tracking, precise ego-motion estimation, and dynamic filtering to enhance real-time performance and safety.
The paper demonstrates state-of-the-art performance in simulation and real-world tests, achieving high success rates and reduced collision counts in dynamic scenarios.

Introduction

CausalNav addresses the challenge of language-directed autonomous navigation in large, dynamic outdoor environments—settings where robots must maintain semantic awareness and robust planning under open-vocabulary queries. Existing methods are limited by static maps, shallow semantic understanding, or poor real-world generalization, and most visual-language navigation (VLN) benchmarks focus on structured, indoor tasks with stepwise instructions. CausalNav introduces a novel paradigm by fusing open-vocabulary perception, dynamic memory graphs, and hierarchical reasoning for robust long-horizon navigation directed by natural language.

Figure 1: The overall workflow of CausalNav, integrating open-vocabulary semantic reasoning, dynamic environment adaptation, and embodied graph-based planning for robust navigation in complex outdoor environments.

Core Methodology

Embodied Graph Construction and Updating

CausalNav employs an Embodied Graph—a multi-level semantic scene graph constructed via LLMs and augmented with Retrieval-Augmented Generation (RAG). The graph hierarchically fuses coarse-grained environmental data (e.g., buildings) and fine-grained detections (e.g., hydrants), dynamically updated as the agent navigates. Key components of the architecture include:

Open-Vocabulary Object Tracking: Utilizes YOLO-World for 2D detection, temporal association with ByteTrack, and fuses RGB with LiDAR for 3D bounding box estimation.
Ego-Motion Estimation: Leverages LiDAR-inertial odometry for accurate trajectory and node updates.
Dynamic Object Filtering: Incorporates BEV-based CenterPoint, motion estimation, and a spatial-temporal corridor representation to prevent graph pollution by mobile entities (see Figure 2).
Figure 3: The CausalNav framework: (1) open-vocabulary object tracking and ego-motion estimation, (2) dynamic object filtering and embodied graph construction, and (3) graph updating and language-guided planning.

Figure 2: Example of multi-timestep 3D bounding boxes for a dynamic vehicle within its spatial-temporal corridor, facilitating dynamic entity filtering.

Figure 4: Simulation environment and constructed Embodied Graph, illustrating the fusion of object-level and building-level nodes for multi-scale semantic representation.

Online updates to the graph are carried out in real time, representing both structural (map-derived) and event-driven (perceptual) changes within a temporal window. This ensures consistent, updated memory for semantic retrieval and planning.

Hierarchical Semantic Clustering and Retrieval

The static graph after dynamic filtering supports hierarchical clustering: object nodes are aggregated based on spatial-semantic similarity into clusters, which are further grouped under buildings. Semantic similarities use both spatial proximity and representation similarity (embedding cosine similarity), making the system resilient to LLM aliasing and open-vocabulary instructions.

Hierarchical retrieval over the Embodied Graph is realized through an LLM-based probability model, allowing flexible and accurate query resolution that simultaneously considers semantic and spatial factors through tunable fusion parameters ( $\alpha$ , $\beta$ , $\gamma$ ).

Given a natural language instruction, CausalNav performs global semantic retrieval over the Embodied Graph to localize the target. If necessary, offline map data or historical trajectories provide coarse global routes; local planning is performed with real-time RH-Map and obstacle-aware informed-RRT*, followed by smoothing and NMPC-CBF for safe, dynamically feasible execution under evolving real-world constraints.

Experimental Evaluation

Simulation Results

CausalNav is evaluated against NoMaD, ViNT, GNM, and CityWalker on urban-scale Gazebo environments with synthetic dynamics. Four main metrics are reported: Success Rate (SR), Success weighted by Path Length (SPL), Collision Count (CC), and Trajectory Length (TL). Results indicate:

CausalNav achieves 100% SR in short-range and 92% in medium-range tasks, outperforming baselines, and exhibits significantly reduced collisions (CC) and path lengths (TL) in dynamic scenes.
CityWalker achieves similar SR and SPL but suffers notably higher CC, mainly due to inferior handling of moving obstacles.
The dynamic graph update mechanism is pivotal: removing online updates leads to a 12% drop in SR and marked increases in CC and TL.

Ablation studies on retrieval parameters confirm optimal performance at $\alpha=\beta=0.5$ , $\gamma=1.5$ , suggesting balanced spatial-semantic fusion and LLM query sharpness (Figure 5).

Figure 5: Ablation study of semantic retrieval parameters ( $\alpha$ , $\beta$ , $\gamma$ ) on accuracy, recall, and trajectory length.

CausalNav attains real-time performance (10 Hz) with only modest computational overhead versus simpler learning-based policies.

Real-World Deployment

The system is validated on a large-campus robot with RealSense, 3D LiDAR, and RTK-GNSS/INS fusion (Figure 6). Evaluations are conducted on both short- (130 m) and long-range (512 m) tasks:

Only CausalNav completes long-range, building-level navigation in dynamic real-world conditions—competing methods consistently fail due to collision or environmental mismatch.
CityWalker, despite strong simulation results, underperforms outdoors due to poor generalization to lighting and dynamic variance.
Figure 6: Mobile robot platform used for real-world navigation experiments.

Figure 7: Real-world evaluation on different distance scales; only CausalNav achieves successful long-range navigation without collisions.

Additional Insights

CausalNav operates robustly using edge-deployed, open-source LLMs, with only modest performance drop versus GPT-4o. This demonstrates practical viability without dependency on external APIs and provides privacy and latency advantages.
Dynamic, hierarchical updates mitigate error accumulation from outdated memory, which is critical for consistent, safe navigation.

Implications and Future Directions

CausalNav represents a transition from static, brittle navigation policies toward semantic memory-driven, adaptive autonomy. The explicit handling of open-vocabulary commands, multi-granularity environment abstraction, and continual memory updates bridge the gap between language-driven reasoning and real-world robotic deployment.

Practical implications include:

Scalable deployment in large, unpredictable outdoor environments (e.g., urban logistics, campus shuttles) where semantic and linguistic flexibility is required.
Edge-based semantic graph and retrieval paradigms provide efficient alternatives to costly cloud models.
The architecture’s memory modularity enables principled lifelong learning extensions.

Theoretically, this work pushes RAG-paradigms and embodied memory further, blending low-level perception, LLMs, and hierarchical planning—opening avenues for robust, generalized robot cognition.

Planned extensions include enhanced graph compression/memory mechanisms, richer multimodal (e.g., weather, illumination) fusion for further robustness, and exploration of long-horizon lifelong autonomy frameworks.

Conclusion

CausalNav delivers a unified framework for semantic, language-driven navigation in large-scale, dynamic outdoor environments. By leveraging a continually updated, multi-level Embodied Graph and integrating LLM-based semantic reasoning with robust perception and planning, the system achieves state-of-the-art results in both simulation and challenging real-world deployments. This work provides a blueprint for next-generation embodied navigation methodologies, emphasizing semantic flexibility, adaptive memory, and robust execution in complex, dynamic scenes.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

Authors (8)

Collections

YouTube

Show All Videos

CausalNav: A Long-term Embodied Navigation System for Autonomous Mobile Robots in Dynamic Outdoor Scenarios

Summary

CausalNav: A Scene Graph-based Semantic Navigation System for Dynamic Outdoor Robot Autonomy

Introduction

Core Methodology

Embodied Graph Construction and Updating

Hierarchical Semantic Clustering and Retrieval

Language-Guided Navigation and Hierarchical Planning

Experimental Evaluation

Simulation Results

Real-World Deployment

Additional Insights

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (8)

Collections

YouTube

CausalNav: A Long-term Embodied Navigation System for Autonomous Mobile Robots in Dynamic Outdoor Scenarios

Summary

CausalNav: A Scene Graph-based Semantic Navigation System for Dynamic Outdoor Robot Autonomy

Introduction

Core Methodology

Embodied Graph Construction and Updating

Hierarchical Semantic Clustering and Retrieval

Language-Guided Navigation and Hierarchical Planning

Experimental Evaluation

Simulation Results

Real-World Deployment

Additional Insights

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (8)

Collections

YouTube