SORT3D: Spatial Object-centric Reasoning Toolbox for Zero-Shot 3D Grounding Using Large Language Models (2504.18684v1)

Published 25 Apr 2025 in cs.CV, cs.AI, and cs.RO

Abstract: Interpreting object-referential language and grounding objects in 3D with spatial relations and attributes is essential for robots operating alongside humans. However, this task is often challenging due to the diversity of scenes, large number of fine-grained objects, and complex free-form nature of language references. Furthermore, in the 3D domain, obtaining large amounts of natural language training data is difficult. Thus, it is important for methods to learn from little data and zero-shot generalize to new environments. To address these challenges, we propose SORT3D, an approach that utilizes rich object attributes from 2D data and merges a heuristics-based spatial reasoning toolbox with the ability of LLMs to perform sequential reasoning. Importantly, our method does not require text-to-3D data for training and can be applied zero-shot to unseen environments. We show that SORT3D achieves state-of-the-art performance on complex view-dependent grounding tasks on two benchmarks. We also implement the pipeline to run real-time on an autonomous vehicle and demonstrate that our approach can be used for object-goal navigation on previously unseen real-world environments. All source code for the system pipeline is publicly released at https://github.com/nzantout/SORT3D .

Authors (6)

Nader Zantout (3 papers)
Haochen Zhang (27 papers)
Pujith Kachana (6 papers)
Jinkai Qiu (1 paper)
Ji Zhang (176 papers)
Wenshan Wang (41 papers)

Summary

SORT3D: Advancements in Zero-Shot 3D Visual Grounding with LLMs

The paper "SORT3D: Spatial Object-centric Reasoning Toolbox for Zero-Shot 3D Grounding Using LLMs" presents a novel approach addressing the challenges of 3D visual grounding (3DVG), a key component for enabling natural language interactions in robots operating in human-centered spaces. The diversity of indoor scenes, numerous fine-grained object classes, and complex natural language references pose significant challenges in 3DVG. Moreover, the scarcity of extensive natural language training data in the 3D domain accentuates the need for methods capable of learning from minimal data and generalizing in zero-shot scenarios.

Core Methodology

SORT3D utilizes a hybrid system combining 2D vision-LLMs (VLMs) and LLMs with a heuristics-based spatial reasoning toolbox. This approach is notable for its independence from direct text-to-3D data training and its zero-shot application capability. The framework integrates rich 2D semantic attributes to enhance object grounding, leveraging open-vocabulary captions generated by VLMs for fine-grained object descriptions. These captions serve a crucial role in distinguishing objects based on detailed attributes like color, material, and shape, often overlooked by pure 3D perceptual models.

The system comprises four primary components:

2D-Enhanced Object Perception: Using VLMs to generate detailed captions for objects, augmenting 3D data with rich semantic information from 2D images.
Relevant Object Filtering: Employing LLM-based parsing to extract pertinent objects and attributes from language queries, thereby reducing the complexity faced by downstream modules.
Spatial Reasoning Toolbox: Harnessing structured logic to compensate for LLMs' limitations in spatial understanding. This modular toolbox enhances LLM-based reasoning, mapping complex spatial relations into sequential steps easily handled by LLMs.
Action Parsing for Navigation: Facilitating downstream applications via code generation that translates language-based navigation commands into executable instructions for robots.

Experimental Results

SORT3D was rigorously evaluated on ReferIt3D and VLA-3D benchmarks, focusing on both synthetic and natural language referential tasks. The method achieved competitive performance with state-of-the-art techniques, specifically excelling in view-dependent and complex spatial relations—areas where zero-shot approaches typically struggle. The reported accuracies demonstrate significant improvement over other models, particularly in handling hard statements in VLA-3D, signifying the efficacy of integrating 2D captions for rich attribute grounding.

Real-World Validation

Furthermore, SORT3D was deployed in a real-world setup on an autonomous robot, successfully performing object-goal navigation tasks. This deployment exemplifies the system's ability to generalize across dynamic, previously unseen environments, highlighting its potential for practical applications in indoor autonomous agents.

Implications and Future Directions

SORT3D contributes significantly to advancing zero-shot 3DVG by demonstrating the power of combining 2D semantic data with sophisticated language reasoning tools. The approach addresses several limitations of traditional 3D focus and opens pathways for deploying robots in complex linguistic and perceptual environments without reliance on extensive 3D training datasets.

Future research could explore expanding the spatial reasoning toolbox functionality, enhancing its algorithmic efficiency, and integrating smaller, deployable, local models to reduce dependency on internet-connected APIs for real-world scenarios. Additionally, developing larger-scale evaluation datasets that incorporate diverse natural language utterances encompassing a broader range of object attributes and spatial relations could further solidify understanding of 3D referential challenges and facilitate improved models.