Track 3: 2025 AI City Challenge

Updated 15 October 2025

Track 3 is a competition focused on fine-grained spatial reasoning using RGB–D data and multimodal fusion of visual, geometric, and linguistic inputs in synthetic warehouse scenarios.
The challenge employs a high-fidelity dataset with approximately half a million visual Q&A pairs, enabling precise evaluation of positional relationships, object counting, and distance measurement.
Advanced system architectures, including modular tool-augmented APIs and efficient transformers, were utilized to achieve state-of-the-art performance in dynamic spatial query answering.

Track 3 of the 2025 AI City Challenge was dedicated to fine-grained spatial reasoning in dynamic warehouse environments, requiring artificial intelligence systems to interpret RGB–D inputs and answer complex spatial questions that integrate perception, geometry, and natural language understanding. The challenge introduced the Physical AI Spatial Intelligence Warehouse Dataset, which consists of synthetic RGB–D scenes generated in NVIDIA Omniverse and accompanied by approximately half a million visual question–answer pairs. Systems were evaluated on their ability to fuse visual features, geometric context, and linguistic instructions to execute spatial queries such as positional relationships, counting, distance measurement, and multi-choice identification in highly dynamic, structured environments.

1. Dataset Characteristics and Task Definition

Track 3 employed the Physical AI Spatial Intelligence Warehouse Dataset, providing high-fidelity RGB–D images and VQA pairs simulating dynamic warehouse scenarios. Each spatial question targeted nuanced aspects of the scene:

Positional reasoning (e.g., “left/right,” “above/below” relationships),
Object counting,
Euclidean distance estimation between named items,
Multi-choice recognition based on scene layout.

The system’s input comprised RGB images, depth channels, semantic object masks, and associated metadata, alongside a structured natural-language question. Participants were required to return human-readable answers consistent with the dataset’s ground-truth labels.

2. System Architectures and Fusion Strategies

A major focus was on multimodal fusion of perception, geometry, and language. Competing teams engineered spatially-aware pipelines with several distinctive characteristics:

Dedicated perception modules extracted visual features using convolutional or transformer-based encoders from the RGB component.
Geometric context extraction utilized depth maps and object masks, with geometric encodings or explicit 3D point cloud calculations to enable answers to queries such as “how far is object A from object B?”.
Language understanding was performed by fine-tuned LLMs or vision–LLMs (VLMs) that parsed question semantics and structured output.
Several systems leveraged modular agent frameworks incorporating tool-augmented spatial APIs for multi-turn reasoning. Early-fusion architectures blended RGB and depth channels before downstream reasoning, while late-fusion approaches allowed each modality to be processed independently before aggregation.
Example architectures include UWIPL_ETRI’s modular system and “SmolRGPT,” a parameter-efficient transformer designed for resource-constrained edge deployment.

3. Integration of Perception, Geometry, and Language

For robust spatial reasoning:

RGB–D Fusion: Systems combined features from RGB images and depth maps, with region-aware modules integrated to localize, count, or measure spatial relationships.
Geometric calculations: Some models included explicit geometric reasoning modules to process 3D point clouds, compute object-centric pose, and derive spatial attributes such as relative orientation and mathematical distance.
Language–visual grounding: Structured prompts guided the LLM towards relevant regions and object references, and output normalization ensured answers adhered to human interpretable formats.

An illustrative algorithmic mechanism was prompt-guided spatial understanding, wherein visual region proposals informed the LLM which part of the image was germane to the query.

4. Evaluation Metrics and Leaderboard Results

Evaluation involved answering nearly 500,000 spatial queries using the dataset’s held-out test set. The top-performing methods demonstrated:

Team	Score	Modality Fusion
UWIPL_ETRI	95.8638	Modular/tool-augmented
SmolRGPT	High	Efficient/region-aware
Others	--	Transformer/VLM-LLM

Leaderboard accuracy reflected the ability to combine perceptual, geometric, and linguistic capabilities. The winning team, UWIPL_ETRI, used modular reasoning and tool-augmented spatial APIs to achieve superior fine-grained spatial question answering (Tang et al., 19 Aug 2025).

5. Methodological Innovations and Scalability

Key innovations included:

Modular architecture supporting easy integration of sensory modalities and API-driven spatial tools.
Efficient multimodal fusion strategies balancing performance and low computational overhead, suitable for warehouse deployment.
Use of region-aware and geometric encodings, plus explicit depth integration, yielding accurate localization and spatial answers.
Prompt engineering, visual-language alignment, and structured-output normalization, collectively enhancing system reliability and user interpretability.

Scalability was achieved via efficient model architectures (e.g., sub-billion parameter transformers), ready for resource-constrained contexts such as industrial robotics and embedded warehouse systems.

6. Implications and Research Directions

Track 3’s results establish new benchmarks in multimodal spatial reasoning for industry-relevant environments. The public release of high-fidelity synthetic warehouse datasets and evaluation protocols is expected to foster reproducibility and future innovation in spatial question answering. The technical advancements will likely influence robotics, automated warehouse management, and vision-language understanding in structured, dynamic spaces.

A plausible implication is the emergence of hybrid systems, integrating perception, geometry, and language into real-time multimodal agents capable of complex spatial reasoning—stimulating the next generation of AI systems for industrial and robotic use cases.

7. Summary

Track 3 of the 2025 AI City Challenge marked a significant advance in the integration of RGB–D visual data, geometric computation, and natural language reasoning for fine-grained spatial question answering in synthetic warehouse environments. The competition established robust methodologies leveraging tool-augmented APIs, region-aware and efficient transformer architectures, and modular agent frameworks, yielding new state-of-the-art benchmarks in spatial intelligence (Tang et al., 19 Aug 2025). These outcomes will drive further research and enable practical deployments of multimodal spatial reasoning systems in industrial automation.

PDF Markdown Chat (Pro)

References (1)

The 9th AI City Challenge (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Track 3 2025 AI City Challenge.