Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 175 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 38 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 218 tok/s Pro
GPT OSS 120B 442 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

VAMOS: A Hierarchical Vision-Language-Action Model for Capability-Modulated and Steerable Navigation (2510.20818v1)

Published 23 Oct 2025 in cs.RO, cs.AI, and cs.LG

Abstract: A fundamental challenge in robot navigation lies in learning policies that generalize across diverse environments while conforming to the unique physical constraints and capabilities of a specific embodiment (e.g., quadrupeds can walk up stairs, but rovers cannot). We propose VAMOS, a hierarchical VLA that decouples semantic planning from embodiment grounding: a generalist planner learns from diverse, open-world data, while a specialist affordance model learns the robot's physical constraints and capabilities in safe, low-cost simulation. We enabled this separation by carefully designing an interface that lets a high-level planner propose candidate paths directly in image space that the affordance model then evaluates and re-ranks. Our real-world experiments show that VAMOS achieves higher success rates in both indoor and complex outdoor navigation than state-of-the-art model-based and end-to-end learning methods. We also show that our hierarchical design enables cross-embodied navigation across legged and wheeled robots and is easily steerable using natural language. Real-world ablations confirm that the specialist model is key to embodiment grounding, enabling a single high-level planner to be deployed across physically distinct wheeled and legged robots. Finally, this model significantly enhances single-robot reliability, achieving 3X higher success rates by rejecting physically infeasible plans. Website: https://vamos-vla.github.io/

Summary

  • The paper introduces a hierarchical framework that decouples high-level semantic planning from embodiment-specific affordance evaluation for versatile navigation.
  • It combines a fine-tuned vision-language model with a lightweight, simulation-trained affordance model to generate steerable and feasible 2D-to-3D paths.
  • The modular design enables rapid adaptation across different robot embodiments, achieving high success rates in both indoor and complex outdoor scenarios.

Hierarchical Vision-Language-Action Models for Capability-Modulated Navigation: An Analysis of VAMOS

Introduction

The VAMOS framework introduces a hierarchical vision-language-action (VLA) model for general-purpose, capability-aware, and steerable robot navigation. The central contribution is the explicit decoupling of high-level semantic planning from embodiment-specific grounding, achieved by combining a generalist vision-LLM (VLM) planner with a lightweight, per-embodiment affordance model. This design enables robust navigation across diverse environments and robot embodiments, while supporting natural language steerability and efficient adaptation to new platforms. The following analysis details the architecture, training methodology, experimental results, and implications for future research in generalist robot navigation.

System Architecture

VAMOS is structured as a two-level hierarchy:

  1. High-Level Planner (VLM): A vision-LLM, fine-tuned on heterogeneous real-world navigation datasets, predicts candidate 2D paths in image space given an input image and a goal coordinate (optionally with appended natural language preferences).
  2. Affordance Model: A lightweight, embodiment-specific function trained in simulation evaluates and re-ranks the candidate paths based on the robot's physical capabilities and local terrain, ensuring only feasible trajectories are executed.

This architecture is illustrated in (Figure 1). Figure 1

Figure 1: The high-level planner is a VLM trained to take an image and a goal coordinate (encoded as text) as input, optionally appending natural language preferences, and to output a set of candidate paths in pixel space. These paths are encoded as strings of location token pairs, then decoded and projected from 2D pixel space to the 3D ground plane. Finally, a capability-aware affordance function evaluates and re-ranks the 3D candidate paths to determine which path the robot should execute in the real world based on low-level policy capabilities.

The interface between the planner and the affordance model is a predicted 2D path, which is projected to the 3D ground plane for affordance evaluation. This design enables the planner to leverage large, heterogeneous datasets while allowing the affordance model to enforce embodiment-specific constraints.

High-Level Planner: Data, Training, and Steerability

The VLM planner is fine-tuned from a pre-trained PaliGemma 2 3B model using a curated mix of four real-world navigation datasets (SCAND, TartanDrive 2, CODa, and Spot), spanning 29.8 hours and multiple robot embodiments. The training data is processed to balance short- and long-horizon trajectories, filter for high-curvature (non-trivial) paths, and align all trajectories to a unified 2D pixel-point representation. This enables the planner to generalize across variable action spaces and sensor modalities.

Steerability is achieved by augmenting 10% of the data with VLM-generated textual annotations and co-training with visual question-answering datasets. At inference, natural language preferences can be appended to the goal specification, allowing the planner to generate paths that respect user-specified constraints (e.g., "stay to the right of people" or "prefer ramps over stairs"), as demonstrated in (Figure 2). Figure 2

Figure 2: Vamos is steerable through natural language preferences appended to its goal coordinate specification. Different preferences are indicated by the shown natural language prompts and depicted using different colors.

Affordance Model: Simulation-Based Capability Modulation

The affordance model FπF_\pi is trained in simulation for each robot embodiment. It predicts the traversability of a given (x, y, a) tuple (position and heading) in a local elevation map MM, outputting a probability of successful traversal under the robot's low-level policy. Training data is generated by rolling out the policy over procedurally generated terrains (including stairs, ramps, and irregular surfaces) and labeling each attempt as success or failure. The model is implemented as an MLP trained with binary cross-entropy loss.

This approach enables rapid, safe, and low-cost adaptation to new embodiments by retraining only the affordance model, while reusing the generalist high-level planner.

Deployment and Control Loop

At deployment, the system operates as follows:

  1. The VLM planner receives the current image and goal coordinate, generating KK candidate 2D paths.
  2. Each path is projected to the 3D ground plane.
  3. The affordance model evaluates each path, computing a cumulative score (minimum affordance along the path).
  4. The path with the highest affordance is selected and executed by the low-level controller in a receding horizon fashion, replanning after a fixed number of waypoints or upon timeout.

This modularity allows for efficient cross-embodiment transfer and robust operation in diverse environments.

Experimental Results

Real-World Navigation Performance

VAMOS was evaluated on both legged (Boston Dynamics Spot) and wheeled (UW Hound) robots across challenging indoor and outdoor courses, including narrow hallways, cluttered atria, occluded labs, urban campuses with stairs, vegetated forests, and down-ramps. The system was compared against a geometric modular stack, ViPlanner, NoMaD, and NaVILA.

VAMOS achieved the highest average success rate (90%) across all courses, outperforming both model-based and end-to-end learning baselines. Notably, it matched or exceeded the modular stack in structured indoor environments and significantly outperformed all baselines in complex outdoor and long-horizon tasks. The use of 2D trajectory prediction preserved the VLM's generalization capabilities, while the affordance model prevented execution of infeasible plans. Figure 3

Figure 3: Top-down map showing paths taken by different methods from start (red) to goal (green) through waypoints (yellow). Vamos achieves long-horizon, precise navigation. Right: predicted and selected paths when replanning after reaching a waypoint. Dotted lines show returns to the last completed waypoint after interventions; X's mark baseline failures or timeouts.

Cross-Embodiment Generalization

A key result is the ability to deploy the same high-level planner across different robot embodiments by swapping only the affordance model. In a scenario requiring selection between stairs and a ramp, the affordance model enabled the wheeled robot to select only the ramp, while the legged robot could traverse both. Without affordance modulation, the wheeled robot failed on stairs in 40% of trials; with modulation, its success rate increased to 90%. Figure 4

Figure 4: We evaluate the cross-embodiment capabilities of Vamos on a wheeled robot, Hound, where the goal (red X) is to reach an elevated floor through either a ramp or stairs. We show 10 candidates predicted by the VLM and their corresponding affordance score. Re-ranking with the affordance function enables higher success rates in cross-embodied navigation as shown in Table \ref{tab:robot-navigation-choices}.

Steerability via Natural Language

The system demonstrated robust steerability, with VLM-generated paths aligning with user-specified preferences in all tested cases. This supports flexible, on-the-fly adaptation to user intent without retraining.

Data Pooling and Generalist vs. Specialist Models

Pooling data from multiple robot datasets improved the VLM's offline path prediction accuracy across all metrics (mean/max L2 error, Fréchet distance, DTW), as shown in (Figure 5) and (Figure 6). Statistical significance tests confirmed consistent improvements over robot-specific models. Figure 5

Figure 5: Pooling data across all robot datasets (red) improves model performance compared to training specialist navigation models on individual robot datasets (teal). We evaluate over the entire validation set. Error bars represent 95% CI.

Figure 6

Figure 6

Figure 6

Figure 6

Figure 6

Figure 6

Figure 6: Max L2 Error.

Affordance Modulation for Single-Robot Robustness

Affordance modulation also improved reliability in single-embodiment, out-of-distribution settings by filtering out VLM-predicted paths that violated physical constraints (e.g., paths through obstacles), increasing success rates from 20% to 60% in challenging scenarios. Figure 7

Figure 7: The affordance function also helps with filtering out noisy VLM predictions in single-embodiment OOD settings. In this example, it filters out the paths predicted by the VLM that go through obstacles to reach the goal (red X), leading to higher success rate (Table \ref{tab:value-function-comparison}).

Implementation Considerations

  • Training: The VLM can be fine-tuned with LoRA adapters on consumer GPUs (e.g., RTX 4090), with full training on 8x Nvidia L40s in ~5 hours. Overfitting is mitigated by limiting to a single epoch and careful data curation.
  • Affordance Model: Training is performed entirely in simulation, requiring only a representative low-level policy and procedurally generated terrains. The model is a small MLP, enabling rapid retraining for new embodiments.
  • Deployment: High-level inference runs at 0.5–1 Hz on an RTX 3080 laptop; the affordance model runs onboard a Jetson Orin AGX. Sensor fusion (LiDAR, depth cameras) is used to construct local elevation maps.
  • Failure Modes: The system can struggle with dynamic obstacles and may overshoot/undershoot turns behind occlusions, primarily due to static training data and uniform trajectory subsampling.

Theoretical and Practical Implications

VAMOS demonstrates that explicit hierarchical decomposition—separating semantic planning from embodiment grounding—enables scalable, generalist navigation policies that are robust to both environmental and embodiment heterogeneity. The use of 2D path prediction as an interface preserves the generalization capacity of large VLMs while supporting efficient, simulation-based adaptation to new robots. The architecture supports natural language steerability, facilitating human-robot interaction in open-world settings.

The results challenge the notion that monolithic end-to-end models or traditional modular stacks are sufficient for general-purpose navigation, highlighting the importance of structured interfaces and embodiment-aware modulation.

Future Directions

  • Dynamic Environments: Incorporating dynamic obstacle modeling and online adaptation to moving agents.
  • Data Selection: Automated data mixture optimization for further improving generalization and sample efficiency.
  • Affordance Learning: Extending affordance models to handle more complex, multi-modal traversability and integrating uncertainty estimation.
  • Broader Embodiment Transfer: Scaling to a wider range of robot morphologies and control policies, including manipulation and aerial platforms.
  • End-to-End Differentiability: Exploring joint training of planner and affordance models for improved coordination.

Conclusion

VAMOS establishes a new state-of-the-art in general-purpose, cross-embodiment, and steerable robot navigation by hierarchically combining a generalist VLM planner with a lightweight, simulation-trained affordance model. The explicit separation of semantic planning and embodiment grounding enables robust transfer, high reliability, and flexible user interaction, providing a scalable foundation for open-world navigation agents. The framework's modularity, data efficiency, and strong empirical performance suggest promising directions for future research in generalist robot autonomy.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

Explaining “VAMOS: A Hierarchical Vision-Language-Action Model for Capability‑Modulated and Steerable Navigation”

1) Overview: What is this paper about?

This paper introduces VAMOS, a two-part “brain” that helps robots find and follow safe paths in the real world. It separates big-picture planning (where to go) from body-specific abilities (what this robot can actually do). A vision‑LLM (an AI that understands pictures and words) draws several possible paths on the camera image, and a second model checks which of those paths the robot’s body can safely handle. This makes navigation more reliable, works across different kinds of robots (like ones with legs or wheels), and lets people steer the robot with simple text instructions.

Key idea in simple terms: one model plans like a map scout, another model acts like a safety inspector who knows the robot’s strengths and limits.

2) What questions did the researchers ask?

They set out to answer a few practical questions:

  • Can one navigation system work well in many places, indoors and outdoors?
  • Can the same planner work across different robot “bodies” (embodiments), like legged robots that can climb stairs versus wheeled robots that can’t?
  • How can we use large mixed datasets (from many robots and terrains) without confusing a robot with moves it physically can’t do?
  • Can we steer the robot’s choices with simple text, like “prefer stairs” or “keep left”?
  • Does splitting planning (general) and physical capabilities (specific) make robots more reliable?

3) How does VAMOS work? (Methods in everyday language)

Think of VAMOS as a two-layer system:

  • High-level planner (Vision‑LLM, VLM):
    • Inputs: a live camera image plus a goal location (written as text), and optionally a preference (like “take the ramp” or “keep to the right”).
    • Output: several candidate paths drawn on the image (like tracing lines on a photo of what the robot sees).
    • Why draw in 2D on the image? It’s a simple, consistent way to learn from many different datasets and robots. Later, those path points are converted into real-world ground positions the robot can follow.
  • Low-level affordance model (capability-aware “safety inspector”):
    • “Affordance” means: can the robot safely go here, given the terrain and its abilities?
    • This model scores each candidate path based on whether the robot can traverse it: for instance, legs can handle stairs, but wheels usually need ramps.
    • It’s trained cheaply and safely in simulation by trying lots of terrains and recording success or failure. That way, it learns, “This slope is okay,” or “That obstacle is too high.”

Workflow:

  1. The planner proposes multiple paths on the image.
  2. The system converts those paths to ground positions.
  3. The affordance model scores each point along each path; a path is rejected if any point is unsafe.
  4. The robot picks the safest high-scoring path to follow and replans frequently as it moves.

Training approach:

  • The planner is a large AI that understands images and text, fine-tuned on diverse, real-world navigation data from several robot datasets.
  • The affordance model is trained in simulation (virtual environments with varied terrain) so it learns physical limits without risking real hardware.
  • Steerability is taught by adding text descriptions and preferences during training, so the planner learns to respect instructions.

4) What did they find, and why is it important?

Here are the main results, explained simply:

  • Strong real-world performance: VAMOS had the highest average success rate across tough indoor and outdoor courses (narrow hallways, cluttered labs, low light areas, long campus routes, forests, ramps, and stairs).
  • Works across robot types: The same high-level planner worked on a legged robot (Spot) and a wheeled robot (Hound). By swapping only the affordance model, the system chose stairs for Spot (which can handle them) and ramps for Hound (which can’t do stairs), improving Hound’s success from 60% to 90%.
  • Language steerability: Adding simple text preferences (like “take the ramp” or “stay to the right”) changed the chosen path in sensible ways. A separate model judging the outputs confirmed the preferences matched the resulting paths.
  • Better from pooled data: Training the planner on a mix of different robots and terrains helped it generalize better than training on a single dataset.
  • More reliable plans: The affordance model filtered out unrealistic or unsafe paths the planner sometimes suggested (like going through obstacles), which significantly boosted success rates in challenging, unfamiliar scenarios. In certain settings, that meant up to roughly triple the success rate compared to not filtering.

Why this matters:

  • Robots get more dependable: The safety filter catches bad ideas before the robot tries them.
  • One planner, many robots: You don’t need to rebuild a whole system for every new robot—just retune the affordance model.
  • Human-friendly control: You can steer the robot’s style of movement using plain language.

5) What is the impact of this research?

VAMOS shows a practical way to build general-purpose robot navigation that:

  • Scales with diverse data without getting confused by mixed abilities in the dataset.
  • Transfers across different robot bodies with minimal changes.
  • Accepts human guidance through natural language.
  • Increases real-world reliability by checking physical feasibility before acting.

Big picture: This is a step toward “open-world” robots that can understand scenes, plan smart paths, respect their physical limits, and respond to simple instructions—making them safer and easier to deploy in everyday environments.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concise list of unresolved issues and missing analyses that future work could address.

  • Monocular-only perception: robustness and gains from adding depth/LiDAR, stereo, or multi-sensor fusion to both planner and affordance model are unexplored.
  • Elevation map source and quality: the paper does not specify how elevation maps are built on real robots (sensor modality, mapping pipeline, update rate, failure cases), nor quantify how map errors impact Fπ and overall success.
  • Sim-to-real transfer of Fπ: the affordance model is trained entirely in simulation with a proxy Spot policy; there is no quantitative analysis of domain mismatch, calibration, or how diversity/randomization choices affect real-world reliability.
  • Limited embodiment coverage: cross-embodiment claims are demonstrated on only two platforms (Spot, Hound); generalization to broader embodiments (tracked vehicles, micro-rovers, aerial, amphibious) and varied kinematics is untested.
  • Heading discretization in Fπ: using eight discrete headings may miss feasibility constraints for non-holonomic vehicles with tight turning radii; continuous heading/action conditioning and kinematic constraints are not considered.
  • Path-level feasibility modeling: Fπ scores points independently and aggregates via min; modeling cumulative risk, recovery likelihood, stability margins, slip, and energy/comfort costs along trajectories remains open.
  • Candidate diversity and recall: the VLM’s K candidate paths may omit the truly feasible/optimal route; how to ensure diverse, high-recall proposals (beam search, coverage metrics, diversity regularizers) is not studied.
  • Selection objective: re-ranking uses only affordance; trade-offs with goal progress, path length, smoothness, and language preferences are not formulated, tuned, or ablated.
  • Occlusions and long-range goals: missions require the goal to be in the image; rotating in place to re-acquire the goal is brittle under occlusion or complex layouts; map/memory-based goal reasoning and multi-view planning are not addressed.
  • Ground-plane projection limits: projecting pixel paths to a “ground plane” can be inaccurate on stairs, ramps, multi-level structures, and steep slopes; a 3D-aware projection and elevation-aware path lifting is not presented or validated.
  • Calibration sensitivity: the approach assumes known intrinsics/extrinsics; sensitivity to calibration errors, camera height/tilt changes, and lens distortions is not evaluated.
  • Runtime and resource footprint: inference latency, throughput, onboard compute requirements, and energy use for the VLM and Fπ are not reported, leaving real-time viability uncertain.
  • Parameter sensitivity: key hyperparameters (path horizon H, K candidates, waypoint count k and executed m, softmax temperature β) lack ablation, tuning guidance, and sensitivity analysis.
  • Failure mode analysis: the paper reports SR/timeouts/interventions but does not categorize failures (perception errors vs. bad proposals vs. affordance miscalibration) or provide corrective strategies.
  • Dynamic obstacles and social compliance: handling moving agents, human-robot interaction, and social norms (despite training on SCAND) is not tested or evaluated.
  • Safety guarantees: no formal safety constraints or guarantees (collision avoidance bounds, certified filtering) are integrated; how language steerability interacts with safety is not specified.
  • Language steerability robustness: quantitative evaluation is limited (single image, VLM-as-a-judge); robustness to ambiguous, contradictory, or long-horizon instructions and human-in-the-loop trials is missing.
  • Preference-affordance integration: when language preferences conflict with feasibility, the decision logic (weights between preference satisfaction and Fπ) is undefined; principled multi-objective fusion is absent.
  • Data curation transparency: the “empirically determined” data mix and filtering (curvature, horizons) are not detailed; contributions of each dataset and potential negative transfer are not ablated.
  • Goal encoding as text: the numeric coordinate tokenization format, its errors, and generalization across scales/units are not analyzed; alternative spatial goal encodings could be explored.
  • Map scale and footprint: Fπ’s map window (W×H), metric scale, and alignment across robots are unspecified; sensitivity to map resolution and footprint size is unknown.
  • Affordance training labels: short-horizon binary success may miss longer-horizon feasibility (e.g., dead-ends, turning feasibility later); richer labels (distance-to-failure, success probability over horizons) are unexplored.
  • Online adaptation: the system does not learn from deployment (no self-calibration, no online updates to VLM/Fπ) or exploit feedback from π to improve proposals and rankings over time.
  • Evaluation breadth and statistics: only five trials per course and some missing baseline metrics (— entries); statistical significance, environmental diversity (weather, nighttime, mud, snow), and broader stress testing are lacking.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Practical Applications

Practical Applications of VAMOS

Below, applications are grouped as Immediate Applications (deployable now with typical robotics infrastructure) and Long-Term Applications (requiring further research, scaling, or development). Each item notes sectors, actionable use cases, potential tools/products/workflows, and key assumptions or dependencies affecting feasibility.

Immediate Applications

  • Robotics and Logistics — Drop‑in, steerable navigation for mobile robots
    • Use cases: Warehouse/campus delivery robots that must avoid stairs, prefer ramps or sidewalks, and adapt routes via natural language (e.g., “keep to the right,” “avoid grass,” “take the ramp”).
    • Tools/products/workflows: “VAMOS Navigator SDK” integrating the VLM path planner with per‑robot affordance adapters; ROS node exposing a path‑in‑image API; operator UI overlaying candidate paths and affordance heatmaps; soft re‑ranking to inject safe stochasticity.
    • Assumptions/dependencies: Monocular RGB camera; reliable elevation map or depth source; calibrated intrinsics/extrinsics; a low‑level locomotion controller (velocity/position tracking); local/global localization (e.g., GPS/SLAM); compute for LoRA‑finetuned VLM; safety monitors.
  • Construction and Mining — Safer autonomy assistance on uneven terrain
    • Use cases: Site inspection carts/UGVs that must traverse ramps and avoid steps, ruts, or loose aggregate; choose traversable detours around temporary obstacles.
    • Tools/products/workflows: Simulation‑trained affordance models tailored to site vehicles; “Embodiment Adapter” training loop to update affordance with new terrain catalogs; deployment in a receding horizon scheme for cautious long‑range goals.
    • Assumptions/dependencies: High‑fidelity terrain models and elevation maps; sim‑to‑real transfer across season/weather; integration with existing safety interlocks; operator language prompts.
  • Public Safety/Defense — Cross‑embodiment mission planning and safety gating
    • Use cases: Search-and-rescue teams fielding both legged and wheeled platforms; swap affordance modules while reusing the same planner so each robot picks feasible routes (e.g., legged up stairs, wheeled via ramp).
    • Tools/products/workflows: “Cross‑Embodiment Deployment Kit” with pre‑trained planners and affordance modules; mission steerability via constrained language prompts (“avoid debris,” “stay on concrete”); route filtering that rejects physically infeasible plans (3× reliability observed).
    • Assumptions/dependencies: Robust comms and localization; validated affordance scores per embodiment; scenario‑appropriate risk thresholds; human-in-the-loop oversight.
  • Healthcare (Hospitals) — Service robot navigation with accessibility preferences
    • Use cases: Hospital delivery robots that avoid stairs, prefer ramps/elevators, and navigate cluttered corridors under low light; quick operator steering (“avoid waiting area,” “take corridor B”).
    • Tools/products/workflows: VLM goal specification from hospital map waypoints; text preference profiles (e.g., “accessibility mode”); affordance re‑ranking that enforces non‑stair policies for wheeled robots.
    • Assumptions/dependencies: Reliable floor‑level mapping; elevator integration; staff training for language steerability; HIPAA/privacy compliance (camera use).
  • Agriculture — Row navigation and terrain preference control
    • Use cases: Orchard/field robots that follow rows, avoid crop beds, and prefer compacted soil tracks; steer with natural language (“stay in row,” “avoid soft soil”).
    • Tools/products/workflows: Affordance training in simulated field terrains; waypointing along GPS/RTK rows; VLM overlay of candidate paths for operator validation.
    • Assumptions/dependencies: Robust elevation/terrain sensing in foliage; localization under canopy; seasonal sim‑to‑real transfer; weather‑hardening.
  • Infrastructure Inspection — Reliable mobility for refineries, plants, and campuses
    • Use cases: Routine inspection robots navigating complex facilities; obey route constraints (e.g., “no stairs,” “stay out of high‑traffic zones”).
    • Tools/products/workflows: Affordance‑gated mission planner; “Route Compliance” language prompts; receding horizon control with periodic re‑planning around occlusions.
    • Assumptions/dependencies: Accurate, up‑to‑date facility maps; safe speed limits; hazard detectors; calibration in GPS‑denied environments.
  • Software (Autonomy Stacks) — Path‑in‑image interface and adapter layer
    • Use cases: Integrate a VLM that predicts continuous 2D image paths into existing stacks; re‑rank via affordance; feed selected waypoints to existing controllers.
    • Tools/products/workflows: “Path‑Planning VLM API” and “Affordance Heatmap Service”; LoRA‑based finetuning recipes; CI pipelines to validate planner reliability against unit terrains.
    • Assumptions/dependencies: Standardized elevation map format (windowed around robot); heading discretization alignment; performance budgets for real‑time inference.
  • Academia/Education — Teaching hierarchical VLA and cross‑embodiment methods
    • Use cases: Courses and labs demonstrating decoupled semantic planning and embodiment grounding; assignments on simulation‑trained affordances and data pooling from heterogeneous robots.
    • Tools/products/workflows: Reproducible finetuning scripts (e.g., PaliGemma 2 3B with LoRA); dataset curation filters (curvature, horizons, noise); steerability augmentation via VLM annotations.
    • Assumptions/dependencies: Access to datasets (SCAND, TartanDrive 2, CODa, in‑domain); simulator (Isaac Lab) and procedural terrain generators; GPU resources.
  • Daily Life (Consumer Robotics) — Voice‑steered home navigation
    • Use cases: Home service robots (fetch, delivery) obeying user preferences (“avoid carpet,” “go around the coffee table”); rejecting infeasible paths to reduce accidents.
    • Tools/products/workflows: Lightweight affordance adapters per model of home robot; user‑level prompts; AR overlay of candidate paths for transparency.
    • Assumptions/dependencies: On‑device or edge compute; privacy‑preserving camera use; depth/elevation estimation indoors; household variability and clutter.

Long‑Term Applications

  • City‑scale general‑purpose navigation across fleets
    • Use cases: Mixed fleets (delivery carts, legged robots) operating outdoors with shared planners and embodiment‑specific affordance modules; mission steerability by city operations staff.
    • Tools/products/workflows: Fleet orchestration layer that distributes goals and preference policies; centralized “Adapter Store” for affordance models by robot type.
    • Assumptions/dependencies: Robust localization without pre‑built maps; large‑scale dataset pooling; urban safety certification.
  • Mapless navigation at scale with robust open‑world generalization
    • Use cases: Reduce dependence on traditional modular stacks and maps; rely on image‑space planning plus embodiment gating in previously unseen environments.
    • Tools/products/workflows: Continuous learning pipelines; failure‑aware replanning; self‑supervised data capture for coverage expansion.
    • Assumptions/dependencies: Strong domain generalization; reliable elevation estimation from monocular cues; on‑device adaptation mechanisms.
  • Standardized affordance libraries and certification frameworks
    • Use cases: Industry‑wide affordance benchmarks for different embodiments; certification that a planner/affordance pairing reliably rejects infeasible actions (safety gating).
    • Tools/products/workflows: Public affordance datasets and testing suites; regulatory guidelines referencing embodiment‑aware gating.
    • Assumptions/dependencies: Agreement on elevation map and heading standards; policy/regulatory adoption; third‑party audits.
  • Cross‑domain extension to aerial, underwater, and manipulators
    • Use cases: Drones with 3D path‑in‑image planning (e.g., avoidance of no‑fly zones); underwater ROVs respecting hull clearances; mobile manipulators grounding reachability.
    • Tools/products/workflows: New 3D/SE(3) path interfaces; domain‑specific affordance models trained in high‑fidelity simulators.
    • Assumptions/dependencies: Appropriate sensing modalities (sonar/lidar); re‑designed interface to capture altitude/depth; domain constraints (e.g., currents, wind).
  • Multi‑robot teaming with language‑steered coordination
    • Use cases: Teams of heterogeneous robots assigned complementary routes (“wheeled take ramps, legged clear stairs”); dynamic re‑planning from operator intent.
    • Tools/products/workflows: Coordination policies that compose multiple candidate paths with team‑level affordance aggregation; language interfaces for task allocation.
    • Assumptions/dependencies: Reliable inter‑robot comms; conflict resolution; scalable inference.
  • Accessibility‑aware routing for public spaces
    • Use cases: Municipal planning tools to simulate accessible routes; cross‑embodiment guidelines for service robots (e.g., ramp coverage vs stairs).
    • Tools/products/workflows: Accessibility policy layers expressed in natural language (preferences), enforced by affordance gating; public maps with terrain affordance tags.
    • Assumptions/dependencies: High‑quality urban terrain data; integration with city GIS; policy alignment with ADA/regional accessibility standards.
  • Planetary exploration and hazardous environments
    • Use cases: Legged/wheeled rovers selecting safe traversals over regolith, rocks, craters; operator steerability under communication delays.
    • Tools/products/workflows: High‑fidelity planetary simulators for affordance training; offline VLM finetuning with synthetic vistas; conservative gating to minimize mission risk.
    • Assumptions/dependencies: Radiation‑hardened compute; limited sensing; uncertain elevation maps; autonomous recovery behaviors.
  • Energy sector inspection (refineries, power plants, offshore)
    • Use cases: Robots navigating complex industrial sites while respecting embodiment constraints and risk‑based preferences (e.g., avoid slick surfaces).
    • Tools/products/workflows: Domain‑specific affordance models validated against industrial hazards; operator prompt templates for safety policies.
    • Assumptions/dependencies: Industrial certification; robust sensing under extreme conditions; incident reporting integration.
  • Consumer AR navigation assistance
    • Use cases: Smartphone apps that visualize candidate pedestrian paths in camera view, modulated by user preferences (“avoid stairs,” “well‑lit routes”).
    • Tools/products/workflows: On‑device VLM path prediction; crowdsourced elevation/obstacle maps; accessibility preferences.
    • Assumptions/dependencies: Reliable pose estimation; privacy safeguards; varied device capabilities.
  • Adaptive compliance and workforce training
    • Use cases: Tools to encode safety and accessibility policies as language preferences that constrain route selection; training operators to steer missions safely.
    • Tools/products/workflows: “Mission Steer” prompt libraries; simulation‑based training that shows how affordance gating avoids accidents; audit logs of rejected plans.
    • Assumptions/dependencies: Organizational adoption; clear risk thresholds; standardized reporting.

Notes on global assumptions and dependencies across applications:

  • Sensor stack: A monocular RGB camera plus a reliable source of local elevation maps (from depth, stereo, lidar, or learned estimation) and calibrated camera intrinsics/extrinsics.
  • Control stack: A low‑level locomotion policy able to track waypoints (velocity or position control) and a receding horizon planner interface.
  • Localization: Goal coordinates must be convertible to image‑frame inputs; fallback behaviors (e.g., rotate to make the goal visible) are assumed.
  • Compute and models: LoRA‑finetuned VLM (e.g., PaliGemma 2 3B) with sufficient real‑time throughput; simulation environments for affordance training with procedurally generated terrains.
  • Safety and policy: Embodiment‑aware affordance gating is crucial for reliability; human oversight, risk thresholds, and domain‑specific safety standards impact deployability.
  • Data and generalization: Performance depends on diverse, heterogeneous training data and the sim‑to‑real fidelity of the affordance models; domain shifts (weather, lighting, clutter) must be addressed.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Glossary

  • 2.5D costmap: A height-annotated grid representation used in navigation that approximates 3D structure while remaining a 2D map. "Second, the intermediate representations, such as 2.5D costmaps, can abstract away valuable information and create performance bottlenecks between modules."
  • Affordance function: A model that predicts whether specific states or path elements are traversable by a given robot policy. "a capability-aware affordance function evaluates and re-ranks the 3D candidate paths to determine which path the robot should execute in the real world based on low-level policy capabilities."
  • Affordance modulation: Using an affordance model to filter or re-rank high-level plan proposals so they align with robot capabilities. "corrected by the affordance function modulation."
  • Binary cross-entropy loss: A loss function for binary classification tasks, here used to train the affordance function from success/failure labels. "by minimizing a standard binary cross-entropy loss"
  • Cellular automata: A procedural generation technique using simple local rules on grids; here used to synthesize irregular terrains in simulation. "we used cellular automata to generate smooth, uneven terrains."
  • Chain-of-thought prompting: A prompting strategy that elicits step-by-step reasoning from a LLM to improve labeling or judgment quality. "use chain-of-thought prompting to ask GPT-5-mini"
  • Cross-embodiment navigation: Navigation methods that transfer across different robot bodies and capability profiles. "general-purpose cross-embodiment and steerable navigation policies."
  • Cumulative affordance: An aggregate traversability score for an entire candidate path, often computed as the minimum per-step affordance along that path. "a cumulative affordance is computed as the minimum affordance score along each path"
  • Embodiment grounding: Aligning high-level plans with the physical capabilities and constraints of a specific robot platform. "decouples semantic planning from embodiment grounding"
  • Elevation map: A grid map encoding terrain height values around the robot, used to assess traversability. "a random elevation map M is spawned"
  • Extrinsic matrix: The camera pose parameters mapping coordinates from world to camera frames. "We use known or estimated intrinsic and extrinsic matrices to project the 3D poses recorded in the datasets into 2D image trajectories."
  • Foundation model: A large pre-trained model whose capabilities scale with diverse data and can be adapted to downstream tasks. "The success of foundation models in other domains has inspired similar efforts in robotics"
  • Hindsight labeling: Creating supervision by labeling trajectories or targets from future states after data collection. "we label trajectories in hindsight using camera poses at a horizon H into the future."
  • Intrinsic matrix: The camera calibration parameters that map 3D camera coordinates to 2D image pixels. "We use known or estimated intrinsic and extrinsic matrices to project the 3D poses recorded in the datasets into 2D image trajectories."
  • Low-rank adapters (LoRA): Parameter-efficient fine-tuning modules inserted into a pre-trained model to adapt it without full retraining. "We use low-rank adapters (LoRAs) since training our models using full-parameter fine-tuning vs LoRA yields similar performance."
  • Monocular RGB image: A single standard color image (as opposed to stereo or depth) used as visual input. "to go from a monocular RGB image II"
  • Odometry: Pose estimates of a robot over time used for labeling trajectories and training navigation models. "contains odometry-labeled data"
  • Out-of-distribution (OOD): Inputs or conditions that differ significantly from the training distribution, often causing model errors. "in OOD settings"
  • Procedurally generated terrains: Synthetic environments created algorithmically to provide diverse training conditions in simulation. "over a large variety of procedurally generated terrains"
  • Receding horizon control: A control scheme that repeatedly plans over a short time horizon and replans as new observations arrive. "in a receding horizon control fashion"
  • Sim-to-real transfer: The process of ensuring policies or models trained in simulation perform well on real hardware. "for proper sim-to-real transfer"
  • Soft sampling: Selecting among candidate plans probabilistically (e.g., via Softmax) to inject controlled stochasticity. "we can sample with soft sampling to allow for some stochasticity in path selection"
  • Traversability estimation: Predicting whether terrain or paths are drivable/walkable for a given robot. "traversability estimation literature"
  • Vision-Language-Action (VLA) model: A model that connects visual input and language to action outputs for embodied tasks. "a hierarchical VLA that decouples semantic planning from embodiment grounding"
  • Vision-LLM (VLM): A model that jointly processes images and text for reasoning and prediction. "a high-capacity vision-LLM (VLM)"
  • VLM-as-a-judge: Using a vision-LLM to evaluate or score outputs (e.g., preference alignment) rather than to act directly. "Using VLM-as-a-judge (ChatGPT 5)"
  • Wave function collapse: A constraint-based algorithm for procedural content generation that assembles patterns consistent with local rules. "using wave function collapse."
Dice Question Streamline Icon: https://streamlinehq.com
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 3 tweets and received 142 likes.

Upgrade to Pro to view all of the tweets about this paper: