Papers
Topics
Authors
Recent
Search
2000 character limit reached

LangNavBench: Language Navigation Benchmark

Updated 3 July 2026
  • LangNavBench is a language-centric benchmark that assesses free-form, open-vocabulary instructions to guide embodied agents in semantic navigation tasks.
  • It incorporates multi-goal episodes, rigorous manual language verification, and detailed annotations across attributes like color, material, and spatial relations.
  • The introduced MLFM baseline significantly improves success rate and path efficiency, outperforming prior methods by over 30 percentage points in key metrics.

LangNavBench is a comprehensive language-centric benchmark that evaluates the natural language grounding capabilities of embodied agents in semantic navigation settings. Unlike prior benchmarks, which typically constrain goal specification to fixed category names or rely on image-based instance references, LangNavBench supports open-vocabulary, free-form natural language instructions—ranging from simple categories (“candle”) to multi-attribute, spatially complex specifications (“the red short pillar candle on the night-stand”). LangNavBench is built upon the LangNav dataset, which features rigorously verified goal descriptions and fine-grained linguistic annotations, supporting the systematic assessment of language understanding in navigation. The benchmark also introduces the Multi-Layered Feature Map (MLFM) baseline, a zero-shot mapping-based approach that establishes new performance standards on language-driven navigation tasks (Raychaudhuri et al., 9 Jul 2025).

1. Motivation and Conceptual Foundation

The emergence of large-scale vision-LLMs (VLMs) has improved language-based semantic navigation, but prior datasets have notable limitations in their reliance on weakly grounded or error-prone linguistic goals. Existing benchmarks such as ObjectNav focus strictly on category recognition, while InstanceNav often employs visual selection mechanisms, bypassing the challenge of understanding natural language queries. Furthermore, many prior corpora employ VLM auto-captioning (e.g., BLIP-2), which introduces errors via hallucinated or missing attributes, mesh artifacts, and spatial mis-references. In contrast, LangNavBench provides:

  • Open-vocabulary, free-form goal descriptions with varying specificity.
  • Multi-goal sequential navigation episodes.
  • Tagging of goal descriptions by linguistic feature, enabling attributional and relational diagnostic analysis.
  • Manual audit and curation of all language passages, minimizing error rates relative to previous datasets (e.g., only 33% of goals in Goat-Bench were error-free) (Raychaudhuri et al., 9 Jul 2025).

2. Dataset Construction and Linguistic Annotation

LangNav, the core dataset for LangNavBench, is derived from the Habitat Synthetic Scenes Dataset (HSSD). Construction details include:

  • Scenes and Objects: 35 synthetic indoor scenes (20 for validation, 15 for test), encompassing 31 object categories (e.g., couch, night-stand, potted plant) and 251 unique instance meshes.
  • Linguistic Description Spectrum: Each goal object receives a natural-language note that may include:
    • Simple category mentions (“lamp”)
    • Adjectival attributes (colour, size, texture, material, state, number, lexical modifiers)
    • Support/spatial relationships (“vase on the coffee table”)
  • Dataset Size: 4,746 goal descriptions (2,496 validation; 2,250 test) and 5,642 tagged linguistic features.
  • Verification and Quality Control: Unlike corpora that extract attributes from vision-LLMs, all LangNav dataset descriptions are created by prompting GPT-4 with strict reference to ground-truth attributes and then manually corrected, eliminating significant description errors.

3. Benchmark Structure and Evaluation Protocol

LangNavBench framing centers on sequential, multi-goal navigation—three goals per episode, each specified by a natural-language instruction whose complexity and attribute structure can vary. Salient evaluation protocols include:

  • Sequential Task Structure: At episode initiation, the agent receives the first goal description; subsequent goals are revealed only upon reaching the previous target or exhausting the step budget (episode continues regardless of single failures).
  • Success Criteria: Success is achieved if the agent declares success within 1.5 m of any valid viewpoint for the described object.
  • Metrics:

    • Success Rate (SR): Fraction of goals reached within a 500-step horizon.
    • Success weighted by Path Length (SPL):

    SPL=1Ni=1NSiimax(i,pi),\text{SPL} = \frac{1}{N}\sum_{i=1}^N S_i\,\frac{\ell_i}{\max(\ell_i, p_i)},

    where SiS_i is success per trial, i\ell_i is the geodesic shortest path, and pip_i is the actual traversed path length. - Linguistic Feature Breakdown: SR and SPL are reported for each of eight linguistic feature classes—colour, size, texture, state, number, material, modifier, and support relations—enabling targeted evaluation of specific language constructs.

4. Multi-Layered Feature Map (MLFM): Approach and Algorithm

The introduced MLFM baseline operationalizes a memory system capable of explicitly storing and querying joint visual and linguistic features across multiple vertical slices of the scene:

  • Map Representation:

    MRL×h×w×fd\mathcal{M} \in \mathbb{R}^{L\times h\times w\times f_d}

    stores an fdf_d-dimensional CLIP-aligned feature vector per cell, stratified by LL height layers.

  • Update Mechanism:

1. Patch-level SED-CLIP features are extracted from each RGB observation. 2. Features are back-projected into a 3D point-cloud with depth information; vertical coordinate determines the height-band index. 3. Temporal feature aggregation via exponential moving averages preserves accumulated semantic evidence.

  • Querying for Goals: A goal’s text embedding Z=ftext(g)\mathcal{Z}=f_{\rm text}(g) is convolved across the layered map via cosine similarity:

    S(l,x,y)=F(l,x,y)ZF(l,x,y)  ZS(l,x,y) = \frac{\mathcal{F}(l,x,y)\cdot \mathcal{Z}}{\|\mathcal{F}(l,x,y)\|\;\|\mathcal{Z}\|}

    For support-relation queries, compositional matching is performed across adjacent height layers (ll for supported objects, SiS_i0 for supporting objects).

  • Exploration Policy: Exploration proceeds in two phases. In the first, only locations where high similarity and detection consensus (YOLO-World) overlap are selected; later, the policy exploits the similarity peak in the feature map.

5. Empirical Performance and Comparative Results

MLFM establishes a new performance baseline. On LangNavBench test episodes:

Method SR (%) SPL (%)
VLMaps 2.0 0.5
VLFM 3.0 1.0
OneMap-v2 39.4 15.1
MLFM 43.6 16.9

MLFM outperforms state-of-the-art mapping-based navigation models, improving SR by +30.3 pp and SPL by +11.5 pp over the strongest competitor (OneMap-v2). Detailed analysis by linguistic attribute shows MLFM excels particularly in:

  • Colour (38.5% vs 26.9% baseline SR)
  • Number (14.3% vs 7.1%)
  • Material (64.4% vs 57.1%)
  • Modifier and support-relational reasoning

This suggests that patch-level CLIP features, distributed across multiple vertical layers and combined with a text-convolutional kernel, provide superior grounding of fine-scale and spatial object attributes relative to flat or single-layer map memories.

6. Analysis, Strengths, and Areas for Development

LangNavBench advances semantic navigation evaluation in several respects:

  • Rich Languistic Annotation: Systematic tagging of goal descriptions supports diagnostic isolation of comprehension challenges.
  • Error-Free Goal Corpus: Manual correction achieves a higher standard of linguistic-label quality than prior datasets.
  • Strong MLFM Baseline: MLFM’s explicit, multi-layered map memory and consensus-based exploration policy yield significant performance gains on fine-attribute and spatial-relational queries.

Identified limitations include:

  • The LangNav dataset’s language phenomena coverage is incomplete: it excludes coreference, negation, rare spatial prepositions, picking up/manipulation, and multi-step verbs.
  • MLFM exhibits failures on texture-centric goals (0% SR) and has deficiencies regarding state recognition, suggesting opportunities for enhanced visual feature extractors and object detectors.

7. Future Directions

LangNavBench establishes a foundation for further research. Anticipated expansions and research trajectories include:

  • Incorporation of more sophisticated linguistic structures—negation, fine-grained counting, relative spatial terms (e.g., “left of”), and sequential/multi-step instructions.
  • Integration of object affordance reasoning or 3D geometric modules to better model texture and temporally dynamic states.
  • Hybridization of MLFM’s explicit, vectorized map with learned stateful policies, including reinforcement learning and modular skill chaining, to merge parametric and non-parametric reasoning.
  • Benchmarking of LLMs as zero-shot planners within LangNavBench, to probe emergent compositions in language grounding and instruction-following (Raychaudhuri et al., 9 Jul 2025).

LangNavBench and LangNav thus fill a critical gap in evaluation methodology for language-based embodied navigation, offering both a resource and a testbed for advancing perceptual, memory, and grounding capabilities in the field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LangNavBench.