- The paper introduces DeCoNav, a decentralized, dialogue-triggered VLN system that enhances long-horizon multi-robot navigation.
- It employs event-driven dialogue and synchronized parallel execution to adaptively reassign roles and reduce combined path lengths.
- Empirical results demonstrate an 11-point success rate improvement and robust performance in both simulation and real-world deployments.
Dialog-Enhanced Long-Horizon Collaborative Vision-Language Navigation with DeCoNav
Introduction
Collaborative Vision-Language Navigation (VLN) for long-horizon tasks critically assesses the ability of multi-robot systems to interpret natural language, synchronize under imperfect information, and adapt to dynamic discovery in real-world environments. The paper "DeCoNav: Dialog enhanced Long-Horizon Collaborative Vision-Language Navigation" (2604.12486) establishes a decentralized and event-driven collaborative VLN system—DeCoNav—alongside DeCoNavBench, an evaluation protocol with rigorous semantic verification. This approach confronts two crucial challenges in existing benchmarks: (i) absence of truly synchronized dual-robot rollouts evaluated on a shared timeline and (ii) over-reliance on static coordination protocols that preclude adaptive online role reassignment.
DeCoNavBench is positioned as an extension of CoNavBench, scaling up both the number of HM3D scenes (from 128 to 176) and annotated tasks (from 992 to 1,213), with a mean per-robot path length more than doubled (20.6m compared to 9.7m). Each collaborative task involves two robots executing a relay-style navigation: Robot 1 follows a start→pickup→handoff trajectory, while Robot 2 performs start→handoff→delivery, with synchronized stepwise execution and event-triggered dialogue-mediated information exchange.
Figure 1: Overview of DeCoNav's architecture, showing its verified episode generation, decentralized event-driven coordination via SVB, EDR, and SPE modules.
The ROVE pipeline underpins DeCoNavBench, providing high-fidelity semantic grounding through cascaded verification at the episode and subgoal level. Notably, RTSA (Room-Type Semantic Alignment) leverages rule-driven algorithms, dual-VLM classification (GPT-5.2 and Qwen3-VL-235B), and human adjudication to label all rooms with 100% verified correctness—eliminating the annotation ambiguity present in prior datasets. TriGate target verification ensures navigational cues pass semantic visibility, region consistency, and visual recognizability (EVA-CLIP finetuned on HM3D).
Figure 2: TriGate verification pipeline with multi-stage semantic and visual gates for ensuring waypoint validity.
Method: Decentralized, Dialogue-Triggered Collaboration
DeCoNav formalizes decentralized synchronized collaboration with three core modules:
This closed-loop protocol obviates the need for a central coordinator, supports robust recovery, and adapts to emergent collaboration requirements in deployment.
Empirical Evaluation
Benchmark Analysis
Direct evaluation on DeCoNavBench highlights several strong claims:
- DeCoNav achieves an absolute Success Rate (SR) gain of 11 percentage points over the CoNavBench baseline (0.39 vs 0.28; relative +39.3%), and a Both-Success Rate (BSR) increase from 0.13 to 0.22 (+69.2%).
- Room annotation correctness is improved to 100% via RTSA, removing the confound of semantic mislabeling plaguing prior platforms.
Ablation reveals that dynamic dialogue-semantic updates (SVB+EDR enabled) yield a 7-point relative SR improvement versus static protocols, with both coordination quality and sample efficiency benefiting from online adaptive role reallocations.
Real-Robot Deployment
Deployment on dual Unitree humanoids in unstructured offices demonstrates transferability beyond simulation. Robots execute decomposed relay tasks under mapless exploration and ROS2-mediated semantic communication. When one robot encounters a dynamic obstacle (locked corridor), EDR enables negotiation for subtask reassignment, yielding reduced path costs and robust task completion without human intervention.
Figure 4: Real-robot demonstration where DeCoNav’s EDR enables subtask swapping upon environment change to minimize combined path length.
Implications and Future Directions
DeCoNav demonstrates that event-triggered, semantic abstraction-based communication enables scalable, adaptive, and robust multi-agent VLN coordination. The system exposes the limitations of prior static policies and unsynchronized evaluations—highlighting that truly collaborative multi-robot navigation requires both closed-loop real-time adaptation and rigorous benchmark protocols free from semantic and temporal confounds.
The confluence of strict synchronization, robust room/object annotation, and dialogue-driven coordination opens new frontiers for:
- Evaluating the scalability and compositionality of decentralized multi-agent VLN systems (e.g., scaling to n>2 robots).
- Extending the SVB protocol for hierarchical or memory-augmented state exchange.
- Integration with hierarchical VLMs and navigation foundation models supporting lifelong adaptation.
- Investigating the open problem of minimizing communication cost versus coordination utility under stricter bandwidth or adversarial constraints.
Conclusion
DeCoNav and DeCoNavBench together instantiate a formal methodology and validated protocol for event-driven, decentralized, and rigorously evaluated collaborative VLN. Through architectural innovation (SVB, EDR, SPE), strict benchmark verification (RTSA, TriGate), and strong empirical performance under both simulation and hardware, DeCoNav establishes new standards for evaluating and deploying resilient multi-robot navigation in semantically complex environments.