Nav-AdaCoT-2.9M: Adaptive Navigation Dataset

Updated 4 July 2026

The paper introduces Nav-AdaCoT-2.9M as the largest embodied navigation dataset, featuring 2.9M steps and 472K selective adaptive CoT annotations.
It employs automated VLM-based reasoning where explicit CoT is triggered only when needed, resulting in improved success rates across multiple navigation tasks.
The dataset supports multi-task co-training and a hybrid SFT+RL pipeline, enabling robust transfer across Object-Goal, Visual Tracking, and Image-Goal Navigation.

Nav-AdaCoT-2.9M is a large-scale embodied-navigation dataset introduced with VLingNav, a VLA model for embodied navigation grounded in linguistic-driven cognition. It contains a total of $N_{\mathrm{step}}=2.9$ million step-level samples, $N_{\mathrm{cot}}=472{,}000$ adaptive Chain-of-Thought annotations, and approximately $100{,}000$ navigation episodes, since average episode length is approximately $30$ steps. Within the VLingNav framework, the dataset is described as the largest embodied navigation dataset with reasoning annotations to date, and its purpose is to support training that couples continuous trajectories with adaptive reasoning and linguistic memory rather than relying only on reactive mappings from observations to actions (Wang et al., 13 Jan 2026).

1. Definition and dataset scope

Nav-AdaCoT-2.9M is organized around three embodied-navigation task categories: Object-Goal Navigation, Embodied Visual Tracking, and Image-Goal Navigation. Object-Goal Navigation is drawn from HM3D ObjNav, MP3D ObjNav, and HM3D OVON; Embodied Visual Tracking is drawn from EVT-Bench; and Image-Goal Navigation is drawn from HM3D Instance ImageNav. The dataset spans HM3D with $718$ scenes, MP3D with $56$ scenes for ObjectNav, EVT-Bench with $703$ scenes, and HM3D Instance with $145$ scenes (Wang et al., 13 Jan 2026).

The task-wise breakdown is approximate: ObjNav accounts for $1.35$M steps, or $46\%$ of the total; VisualTrack accounts for $N_{\mathrm{cot}}=472{,}000$ 0M steps, or $N_{\mathrm{cot}}=472{,}000$ 1; and ImageNav accounts for $N_{\mathrm{cot}}=472{,}000$ 2M steps, or $N_{\mathrm{cot}}=472{,}000$ 3. This distribution places the dataset across multiple embodied-navigation regimes rather than a single benchmark family, and a plausible implication is that the dataset was designed to support both multi-task co-training and transfer across task formats (Wang et al., 13 Jan 2026).

Task category	Source benchmarks	Approximate scale
Object-Goal Navigation	HM3D ObjNav, MP3D ObjNav, HM3D OVON	$N_{\mathrm{cot}}=472{,}000$ 4M steps ( $N_{\mathrm{cot}}=472{,}000$ 5)
Embodied Visual Tracking	EVT-Bench	$N_{\mathrm{cot}}=472{,}000$ 6M steps ( $N_{\mathrm{cot}}=472{,}000$ 7)
Image-Goal Navigation	HM3D Instance ImageNav	$N_{\mathrm{cot}}=472{,}000$ 8M steps ( $N_{\mathrm{cot}}=472{,}000$ 9)

Where applicable, standard benchmark splits are retained. HM3D OVON includes val_seen, val_seen_synonyms, and val_unseen; HM3D ObjNav v1/v2, MP3D, and HM3D Instance all have standard seen/unseen splits. This preserves compatibility with established evaluation protocols instead of redefining task partitions for the dataset (Wang et al., 13 Jan 2026).

2. Data sources and collection process

The simulation platform is Habitat-Simulator on HM3D and MP3D reconstructions, EVT-Bench for visual tracking, and HM3D Instance for ImageNav. Expert trajectories are heterogeneous by construction. HM3D ObjNav uses human-recorded demos from Habitat-Web; all other ObjNav and ImageNav episodes use shortest-path planners; and EVT uses annotated multi-person tracking sequences (Wang et al., 13 Jan 2026).

Annotation is fully automated at the reasoning level via a VLM, specifically Qwen2.5-VL. No manual step-by-step CoT labeling was done; human involvement was limited to prompt design and final quality verification. This distinction is important because the scale of $100{,}000$ 0 million steps and $100{,}000$ 1 adaptive CoT annotations would be difficult to obtain with manual, stepwise reasoning supervision (Wang et al., 13 Jan 2026).

The dataset therefore combines multiple supervision modalities: human demonstrations, planner-generated trajectories, and annotated tracking sequences, with reasoning labels added by an automated VLM pipeline. This suggests that Nav-AdaCoT-2.9M is not merely a replay buffer of trajectories but a structured training resource intended to align action prediction, explicit reasoning, and memory formation within a common data format (Wang et al., 13 Jan 2026).

3. Adaptive Chain-of-Thought annotation methodology

The central annotation principle is Adaptive Chain-of-Thought, or AdaCoT. AdaCoT teaches the model “when to think” and “what to think.” In the labeling pipeline, Qwen2.5-VL-72B is prompted with five inputs: the natural-language instruction, the most recent $100{,}000$ 2 video frames to bound computation, the agent’s prior memory represented as <summary> tokens, expert trajectory steps, and formatting constraints consisting of <think_on>/<think_off>, > ... </think>, and <summary> ... </summary> (Wang et al., 13 Jan 2026).

For each step $100{,}000$ 3, the pipeline first yields an indicator token. It produces <think_on> if explicit reasoning is required at $100{,}000$ 4, and <think_off> otherwise. When <think_on> is produced, the model emits a structured reasoning chain inside <think> ... and a concise linguistic memory inside <summary> ... </summary>. The reasoning chain covers spatial perception, task decomposition, loop avoidance, and next-action reasoning; the summary is appended to future context as memory (Wang et al., 13 Jan 2026).

The filtering procedure has two stages. First, rule-based validation removes incomplete or ill-formed CoTs. Second, cross-checking ensures that the CoT’s “decision” matches the expert trajectory. In effect, the dataset does not treat any generated reasoning trace as valid supervision; it constrains reasoning to be structurally well formed and behaviorally consistent with the trajectory labels (Wang et al., 13 Jan 2026).

This annotation design differentiates adaptive reasoning from dense reasoning. Rather than requiring a chain of thought at every step, the dataset explicitly represents sparsity in reasoning activation. That design is central to the later empirical claim that over-reasoning hurts performance (Wang et al., 13 Jan 2026).

4. Statistical profile and reasoning sparsity

The annotation density is defined as

$100{,}000$ 5

that is, approximately $100{,}000$ 6. During inference, the model triggers reasoning only $100{,}000$ 7 of the time (Wang et al., 13 Jan 2026).

These numbers matter because the dataset couples relatively sparse reasoning supervision with much denser trajectory supervision. Nav-AdaCoT-2.9M therefore does not frame explicit reasoning as the default mode of operation. Instead, explicit reasoning is a selectively invoked mechanism attached to specific states in a trajectory, while the remainder of navigation can proceed without a generated reasoning trace (Wang et al., 13 Jan 2026).

A frequent misconception in embodied-reasoning settings is that adding more reasoning tokens at more steps should monotonically improve performance. The reported ablations directly contradict that view. On ObjNav, no CoT yields SR $100{,}000$ 8, adaptive CoT yields SR $100{,}000$ 9, dense CoT at every step yields SR $30$0, and fixed-interval CoT reaches a best SR of approximately $30$1 but at fixed cost. Adaptive CoT reaches SR $30$2 while using only $30$3 reasoning steps (Wang et al., 13 Jan 2026).

This pattern suggests that the dataset’s key contribution is not only scale but the representation of conditional reasoning. In other words, the supervision signal encodes a policy over reasoning itself, not just the content of reasoning once invoked (Wang et al., 13 Jan 2026).

5. Integration with supervised fine-tuning and post-training

Within the VLingNav training recipe, Nav-AdaCoT-2.9M is used in Supervised Fine-Tuning with both continuous-trajectory $30$4 labels and textual labels consisting of CoT and summaries. The SFT loss combines an MSE term on predicted trajectories and a CE term on predicted textual outputs, with $30$5:

$30$6

This makes the dataset simultaneously a control-learning resource and a reasoning-supervision resource (Wang et al., 13 Jan 2026).

The training recipe then adds an online expert-guided reinforcement learning stage. A hybrid buffer collects $30$7 from both on-policy rollouts and expert-guided recoveries. The post-training objective is

$30$8

The accompanying code sketch indicates that unsuccessful rollouts are supplemented by expert.recover(rollout) before policy updates (Wang et al., 13 Jan 2026).

The dataset is therefore embedded in a hybrid SFT+RL pipeline rather than being restricted to offline imitation learning. This is consequential for interpretation of results: the paper explicitly attributes part of the final performance to an online expert-guided RL stage that enables the model to surpass pure imitation learning and acquire more robust, self-explored navigation behaviors (Wang et al., 13 Jan 2026).

6. Benchmarking, comparisons, and reported impact

In Table 2, Nav-AdaCoT-2.9M is contrasted with Nav-CoT-110K. Nav-CoT-110K has $30$9K steps and $718$0K CoT annotations, corresponding to $718$1 CoT density, and uses discrete actions. Nav-AdaCoT-2.9M has $718$2M steps and $718$3K adaptive CoT annotations, corresponding to $718$4 CoT density, and uses continuous trajectories. The comparison highlights two axes of difference: scale and annotation policy, with the latter shifting from always-on reasoning to adaptive reasoning (Wang et al., 13 Jan 2026).

The end-to-end performance gains reported for VLingNav after moving from SFT to SFT plus RL are substantial on the benchmarks named in the paper. HM3D v1 ObjNav improves from $718$5 to $718$6 SR, a gain of $718$7; HM3D OVON unseen improves from $718$8 to $718$9 SR, a gain of $56$0; and ImageNav SPL improves from $56$1 to $56$2, a gain of $56$3 (Wang et al., 13 Jan 2026).

The paper also reports cross-task synergy: multi-task co-training on Nav-AdaCoT-2.9M outperforms single-task variants and yields emergent zero-shot Image-to-Track and Track-to-ObjNav behaviors. This suggests that the shared data format across ObjNav, EVT, and ImageNav is not only a matter of convenience but a mechanism for transfer between embodied-navigation tasks with different goal specifications (Wang et al., 13 Jan 2026).

At the system level, the dataset is described as directly enabling the state-of-the-art results reported for VLingNav across a wide range of benchmarks, as well as robust zero-shot transfer to real quadrupeds. More generally, Nav-AdaCoT-2.9M occupies a specific position in the embodied-navigation literature: it is a large-scale dataset in which trajectory supervision, selective reasoning supervision, and linguistic memory supervision are co-designed rather than assembled independently.

Markdown Report Issue Upgrade to Chat

References (1)

VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Nav-AdaCoT-2.9M.