IndustryNav Benchmark
- IndustryNav is a benchmark that assesses embodied spatial reasoning in dynamic, realistic industrial environments with high-fidelity warehouse simulations.
- It employs Unity3D-modeled scenarios featuring moving obstacles and human actors, using metrics like success ratio, collision ratio, and distance ratio for performance evaluation.
- The benchmark exposes current VLLM limitations in proactive planning and safety awareness, urging research into more robust, anticipatory navigation methods.
IndustryNav is a benchmark for evaluating the spatial reasoning and navigation capabilities of embodied agents—specifically Visual LLMs (VLLMs)—in dynamic industrial environments. Designed to fill a gap in embodied AI research, IndustryNav departs from household-centric, static scene benchmarks by introducing realistic, high-complexity warehouse scenarios with dynamic objects and human actors. The benchmark’s core focus is on holistic assessment of perception, planning, and action, with explicit evaluation of safe behavior and robust navigation under dynamic, uncertain conditions (Li et al., 21 Nov 2025).
1. Benchmark Motivation and Design Principles
IndustryNav was developed to address the inadequacies of existing embodied benchmarks, which predominantly target static environments with passive reasoning tasks and isolated capability measurement. In intralogistics and warehouse settings, continuous movement of assets (forklifts, robots) and personnel necessitates persistent risk assessment, real-time scene understanding, and dynamic path re-planning. The benchmark aims to:
- Facilitate evaluation of active, embodied spatial reasoning (perception + planning + action)
- Introduce safety-centric performance metrics—collision and warning rates—to reflect real-world operational requirements
- Bridge the current gap between VLLM passive scene understanding and the demands of robust, interactive navigation in realistic, dynamic industrial domains (Li et al., 21 Nov 2025)
2. Industrial Environment Construction
IndustryNav consists of twelve distinct, high-fidelity warehouse scenarios modeled by domain experts in Unity3D using HDRP. The environmental diversity spans sparse layouts to densely packed zones with realistic OSH-conforming obstacles—shipping containers, conveyor belts, dynamic storage racks—and operational features such as lighting, beams, and various handling machinery. Dynamics are encoded via:
- Pre-scripted trajectories for mobile obstacles (forklifts, cargo robots) and human workers, utilizing Unity’s Splines system
- Carefully tuned collider geometries for all interactable assets, enabling physically plausible collision detection
- Sensors on the agent: an egocentric 1024×1024 camera (first-person RGB), global odometry ((x, y), heading θ updated per timestep), and an auxiliary bird’s-eye view for ablations (not available to agent policy) (Li et al., 21 Nov 2025)
3. Task Formulation and Agent Interface
The primary IndustryNav task is active PointGoal navigation:
- Each scenario defines four start-goal pairs of diverse difficulty; each navigation episode allows up to discrete timesteps
- Success is achieved if the agent reaches within px (0.6 m) of the target
- Agent inputs per timestep: egocentric camera image, global odometry (, , θ), target specification, scalar distance to goal, and a rolling action–state history (last 10 steps: (position, heading, action, distance))
- The discrete action set: turn_left (θ←(θ–90) mod 360), turn_right (θ←(θ+90) mod 360), forward (move Δ=34 px along θ), and stop (when within )
- Sensory cardinal direction mapping aligns θ to {West, North, East, South} at canonical angles (Li et al., 21 Nov 2025)
4. Evaluation Pipeline and Metrics
IndustryNav utilizes a multi-stage evaluation procedure:
- Perception: Agents process egocentric RGB frames for immediate obstacle recognition.
- Localization: Odometry and local history support self-positioning and loop-avoidance.
- Global Planning: Action sequences are computed to reduce the global distance-to-goal.
- Obstacle Avoidance: Agents are expected to leverage local perception to forego collision-inducing moves.
- Control Output: Actions are emitted in JSON, accompanied by generated natural-language reasoning.
- Loop Avoidance: Prompts emphasize not repeating failed, cyclical moves.
Key performance metrics:
- Success Ratio (SR): Fraction of episodes with final position
- Distance Ratio (DR): Mean normalized progress,
- Average Steps (AS): , efficiency proxy—lower is better
- Collision Ratio (CR):
: number of “forward” collisions in episode ; : total forward actions
- Warning Ratio (WR):
: number of “warning frames” with forward ROI depth m; : steps in episode
- Comparison to SPL: DR substitutes for Success weighted by Path Length (SPL), enabling more granular efficiency assessment (Li et al., 21 Nov 2025)
5. Comparative Agent Study and Results
IndustryNav evaluates nine state-of-the-art VLLMs—five closed-source (e.g., GPT-4o, GPT-5-mini, Claude, Gemini-2.5) and four open-source models (e.g., Nemotron-nano-12B, LLaMA-4-Scout). All models are tested in a zero-shot regime via OpenRouter API without task-specific fine-tuning. Experimental conditions:
- Each agent: 70-step episodes, δ=20 px, warning threshold=1 m, history length=10
- Commodity Windows/macOS platforms and consumer-grade GPUs
Performance averages:
| Model | SR (%) | DR (%) | AS | CR (%) | WR (%) |
|---|---|---|---|---|---|
| GPT-4o | 21.53 | 49.41 | 66.76 | 7.86 | 13.45 |
| GPT-5-mini | 54.17 | 81.90 | 49.91 | 16.89 | 24.13 |
| Claude-Haiku-4.5 | 61.81 | 82.87 | 46.80 | 32.18 | 31.57 |
| Claude-Sonnet-4.5 | 61.81 | 86.26 | 47.33 | 27.68 | 31.52 |
| Gemini-2.5-flash | 65.28 | 84.49 | 45.95 | 32.14 | 37.16 |
| Nemotron-nano-12B | 55.56 | 80.48 | 50.69 | 31.73 | 36.54 |
| LLaMA-4-Scout | 15.28 | 56.40 | 61.53 | 24.38 | 35.06 |
| Qwen3-VL-30B | 6.25 | 26.20 | 66.70 | 18.97 | 26.28 |
| Qwen3-VL-8B | 4.86 | 27.05 | 67.22 | 27.82 | 25.70 |
Consistent trends:
- Closed-source models outperform open-source on SR, DR, and AS
- Nemotron-nano-12B approaches leading closed-source models in SR/DR
- All models show high CR/WR, indicating broad deficiencies in safe behavior (Li et al., 21 Nov 2025)
6. Failure Modes and Qualitative Analysis
Observed navigation issues include:
- Global replanning failures: Agents ignore blocked direct paths and loop futilely around obstructions.
- Exploration deficiencies: Lack of proactive exploration, frequent repetition of failed actions, and insufficient memory utilization render agents reactive rather than anticipatory.
- Collision avoidance challenges: Agents misestimate distances and execute unsafe advances, leading to elevated collision and warning ratios.
- Safety-awareness gaps: Delayed or absent evasive actions when interacting with dynamic obstacles, particularly moving humans and vehicles.
Path planning is dominated by greedy heuristics that fail to robustly incorporate dynamic forecasting or map-based cues, as ablation studies with bird’s-eye input showed limited improvement. Agents were unable to leverage top-down representations to materially enhance performance, underscoring limitations of VLLM spatial cognition in this context.
7. Implications and Directions for Embodied Research
IndustryNav exposes the limitations of current VLLMs in executing long-horizon navigation under dynamic, real-world constraints. Safety-oriented reasoning, such as proactive collision prediction and maintenance of buffer zones, remains inadequate. Proposed future research avenues:
- Proactive agents: Integration of reinforcement learning for lookahead planning (e.g., Monte Carlo rollouts) to foster non-greedy, anticipatory path search
- Safety-aware navigation: Explicit depth or point-cloud modality fusion and local avoidance policy learning
- Resource-constrained deployment: Architectural innovations for lightweight, on-device spatial reasoning
- Continual learning: Policy adaptation to evolving layouts and traffic dynamics at runtime
IndustryNav establishes a challenging, realism-focused testbed and prioritizes development of safe, robust, and proactive embodied spatial reasoning methods, directly targeting the operational requirements of next-generation industrial robotics and warehouse automation (Li et al., 21 Nov 2025).