Papers
Topics
Authors
Recent
Search
2000 character limit reached

DeCoNav: Dialog enhanced Long-Horizon Collaborative Vision-Language Navigation

Published 14 Apr 2026 in cs.RO | (2604.12486v1)

Abstract: Long-horizon collaborative vision-language navigation (VLN) is critical for multi-robot systems to accomplish complex tasks beyond the capability of a single agent. CoNavBench takes a first step by introducing the first collaborative long-horizon VLN benchmark with relay-style multi-robot tasks, a collaboration taxonomy, along with graph-grounded generation and evaluation to model handoffs and rendezvous in shared environments. However, existing benchmarks and evaluations often do not enforce strictly synchronized dual-robot rollout on a shared world timeline, and they typically rely on static coordination policies that cannot adapt when new cross-agent evidence emerges. We present Dialog enhanced Long-Horizon Collaborative Vision-Language Navigation (DeCoNav), a decentralized framework that couples event-triggered dialogue with dynamic task allocation and replanning for real-time, adaptive coordination. In DeCoNav, robots exchange compact semantic states via dialogue without a central controller. When informative events such as new evidence, uncertainty, or conflicts arise, dialogue is triggered to dynamically reassign subgoals and replan under synchronized execution. Implemented in DeCoNavBench with 1,213 tasks across 176 HM3D scenes, DeCoNav improves the both-success rate (BSR) by 69.2%, demonstrating the effectiveness of dialogue-driven, dynamically reallocated planning for multi-robot collaboration.

Summary

  • The paper introduces DeCoNav, a decentralized, dialogue-triggered VLN system that enhances long-horizon multi-robot navigation.
  • It employs event-driven dialogue and synchronized parallel execution to adaptively reassign roles and reduce combined path lengths.
  • Empirical results demonstrate an 11-point success rate improvement and robust performance in both simulation and real-world deployments.

Dialog-Enhanced Long-Horizon Collaborative Vision-Language Navigation with DeCoNav

Introduction

Collaborative Vision-Language Navigation (VLN) for long-horizon tasks critically assesses the ability of multi-robot systems to interpret natural language, synchronize under imperfect information, and adapt to dynamic discovery in real-world environments. The paper "DeCoNav: Dialog enhanced Long-Horizon Collaborative Vision-Language Navigation" (2604.12486) establishes a decentralized and event-driven collaborative VLN system—DeCoNav—alongside DeCoNavBench, an evaluation protocol with rigorous semantic verification. This approach confronts two crucial challenges in existing benchmarks: (i) absence of truly synchronized dual-robot rollouts evaluated on a shared timeline and (ii) over-reliance on static coordination protocols that preclude adaptive online role reassignment.

Problem Formulation and Benchmark Construction

DeCoNavBench is positioned as an extension of CoNavBench, scaling up both the number of HM3D scenes (from 128 to 176) and annotated tasks (from 992 to 1,213), with a mean per-robot path length more than doubled (20.6m compared to 9.7m). Each collaborative task involves two robots executing a relay-style navigation: Robot 1 follows a start→pickup→handoff trajectory, while Robot 2 performs start→handoff→delivery, with synchronized stepwise execution and event-triggered dialogue-mediated information exchange. Figure 1

Figure 1: Overview of DeCoNav's architecture, showing its verified episode generation, decentralized event-driven coordination via SVB, EDR, and SPE modules.

The ROVE pipeline underpins DeCoNavBench, providing high-fidelity semantic grounding through cascaded verification at the episode and subgoal level. Notably, RTSA (Room-Type Semantic Alignment) leverages rule-driven algorithms, dual-VLM classification (GPT-5.2 and Qwen3-VL-235B), and human adjudication to label all rooms with 100% verified correctness—eliminating the annotation ambiguity present in prior datasets. TriGate target verification ensures navigational cues pass semantic visibility, region consistency, and visual recognizability (EVA-CLIP finetuned on HM3D). Figure 2

Figure 2: TriGate verification pipeline with multi-stage semantic and visual gates for ensuring waypoint validity.

Method: Decentralized, Dialogue-Triggered Collaboration

DeCoNav formalizes decentralized synchronized collaboration with three core modules:

  • Semantic Visual Bus (SVB): Each robot encodes local semantic context (room type, objects, task stage, timestamp) into compact state packets. Only semantic packets, not high-dimensional raw observations or policy features, are transmitted asynchronously, conforming to realistic bandwidth and deployment constraints.
  • Event-driven Dialogue Replanning (EDR): Instead of continuous communication, EDR triggers dialogue only under semantically informative events: new evidence (e.g., early goal discovery), state conflicts, stagnation, or subtask milestones (e.g., handoff opportunity). On trigger, robots exchange their semantic memories, update goal allocation, and potentially reassign subtasks online.
  • Synchronous Parallel Execution (SPE): Dual-robot rollout proceeds under a strict shared world-clock, ensuring causal consistency in both simulation and real deployment—eliminating asynchronous evaluation artifacts typical in prior works. Figure 3

    Figure 3: Dynamic subtask reassignment via EDR, reducing combined path lengths by leveraging event-triggered dialogue.

This closed-loop protocol obviates the need for a central coordinator, supports robust recovery, and adapts to emergent collaboration requirements in deployment.

Empirical Evaluation

Benchmark Analysis

Direct evaluation on DeCoNavBench highlights several strong claims:

  • DeCoNav achieves an absolute Success Rate (SR) gain of 11 percentage points over the CoNavBench baseline (0.39 vs 0.28; relative +39.3%), and a Both-Success Rate (BSR) increase from 0.13 to 0.22 (+69.2%).
  • Room annotation correctness is improved to 100% via RTSA, removing the confound of semantic mislabeling plaguing prior platforms.

Ablation reveals that dynamic dialogue-semantic updates (SVB+EDR enabled) yield a 7-point relative SR improvement versus static protocols, with both coordination quality and sample efficiency benefiting from online adaptive role reallocations.

Real-Robot Deployment

Deployment on dual Unitree humanoids in unstructured offices demonstrates transferability beyond simulation. Robots execute decomposed relay tasks under mapless exploration and ROS2-mediated semantic communication. When one robot encounters a dynamic obstacle (locked corridor), EDR enables negotiation for subtask reassignment, yielding reduced path costs and robust task completion without human intervention. Figure 4

Figure 4: Real-robot demonstration where DeCoNav’s EDR enables subtask swapping upon environment change to minimize combined path length.

Implications and Future Directions

DeCoNav demonstrates that event-triggered, semantic abstraction-based communication enables scalable, adaptive, and robust multi-agent VLN coordination. The system exposes the limitations of prior static policies and unsynchronized evaluations—highlighting that truly collaborative multi-robot navigation requires both closed-loop real-time adaptation and rigorous benchmark protocols free from semantic and temporal confounds.

The confluence of strict synchronization, robust room/object annotation, and dialogue-driven coordination opens new frontiers for:

  • Evaluating the scalability and compositionality of decentralized multi-agent VLN systems (e.g., scaling to n>2n>2 robots).
  • Extending the SVB protocol for hierarchical or memory-augmented state exchange.
  • Integration with hierarchical VLMs and navigation foundation models supporting lifelong adaptation.
  • Investigating the open problem of minimizing communication cost versus coordination utility under stricter bandwidth or adversarial constraints.

Conclusion

DeCoNav and DeCoNavBench together instantiate a formal methodology and validated protocol for event-driven, decentralized, and rigorously evaluated collaborative VLN. Through architectural innovation (SVB, EDR, SPE), strict benchmark verification (RTSA, TriGate), and strong empirical performance under both simulation and hardware, DeCoNav establishes new standards for evaluating and deploying resilient multi-robot navigation in semantically complex environments.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.