Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 100 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 103 tok/s
GPT OSS 120B 480 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding (2508.20758v1)

Published 28 Aug 2025 in cs.CV and cs.AI

Abstract: 3D Visual Grounding (3DVG) aims to localize objects in 3D scenes using natural language descriptions. Although supervised methods achieve higher accuracy in constrained settings, zero-shot 3DVG holds greater promise for real-world applications since eliminating scene-specific training requirements. However, existing zero-shot methods face challenges of spatial-limited reasoning due to reliance on single-view localization, and contextual omissions or detail degradation. To address these issues, we propose SeqVLM, a novel zero-shot 3DVG framework that leverages multi-view real-world scene images with spatial information for target object reasoning. Specifically, SeqVLM first generates 3D instance proposals via a 3D semantic segmentation network and refines them through semantic filtering, retaining only semantic-relevant candidates. A proposal-guided multi-view projection strategy then projects these candidate proposals onto real scene image sequences, preserving spatial relationships and contextual details in the conversion process of 3D point cloud to images. Furthermore, to mitigate VLM computational overload, we implement a dynamic scheduling mechanism that iteratively processes sequances-query prompts, leveraging VLM's cross-modal reasoning capabilities to identify textually specified objects. Experiments on the ScanRefer and Nr3D benchmarks demonstrate state-of-the-art performance, achieving [email protected] scores of 55.6% and 53.2%, surpassing previous zero-shot methods by 4.0% and 5.2%, respectively, which advance 3DVG toward greater generalization and real-world applicability. The code is available at https://github.com/JiawLin/SeqVLM.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper presents a proposal-guided framework that blends 3D segmentation with multi-view image sequences to enhance zero-shot 3D visual grounding.
  • It introduces an iterative VLM reasoning module to dynamically select the best candidate proposals while addressing spatial ambiguities and input constraints.
  • Experimental results on ScanRefer and Nr3D benchmarks demonstrate significant accuracy gains, rivaling fully supervised approaches.

SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding

Introduction

SeqVLM introduces a zero-shot 3D visual grounding framework that leverages proposal-guided multi-view image sequences and visual-LLMs (VLMs) to localize objects in 3D scenes based on natural language queries. The method addresses the limitations of prior zero-shot approaches, which suffer from spatial reasoning constraints, contextual omissions, and detail degradation due to reliance on single-view renderings and domain gaps between synthetic and real images. SeqVLM integrates 3D semantic segmentation, multi-view real-world image projection, and iterative VLM-based reasoning to achieve robust cross-modal alignment and precise object localization without scene-specific training.

Methodology

Proposal Selection Module

SeqVLM employs a 3D semantic segmentation network (e.g., Mask3D) to generate instance proposals from the input point cloud. Semantic filtering is performed by embedding both the proposal categories and the LLM-parsed target category using a text encoder (CLIP-ViT-Base-Patch16). Cosine similarity is computed to retain only proposals semantically aligned with the query, reducing the candidate set and computational complexity for subsequent VLM reasoning. This module bridges 3D segmentation with language semantics, providing a critical performance boost.

Proposal-Guided Multi-View Projection

To adapt 3D proposals for VLM processing, SeqVLM projects each candidate onto multiple real-world scene images, selecting the top nn views with maximal projected area. For each view, the proposal's 3D coordinates are transformed to camera coordinates and mapped to 2D pixel locations using intrinsic and extrinsic matrices. Depth consistency checks ensure valid projections. The bounding box for each proposal is expanded and annotated, and the selected views are vertically concatenated to form an image sequence that preserves spatial relationships and contextual details. This multi-view fusion mitigates occlusion and viewpoint ambiguity, enhancing VLM's cross-modal reasoning.

VLM Iterative Reasoning Module

SeqVLM introduces an iterative reasoning mechanism to address VLM input length constraints and computational overload. Candidate image sequences are batched (default L=4L=4), and each batch is paired with the textual query in a prompt template. The VLM agent selects the best-matching candidate per batch, and the process iterates until a single candidate remains. This dynamic scheduling optimizes both inference efficiency and localization accuracy, circumventing VLM limitations in long-sequence reasoning.

Pseudocode for Iterative Reasoning

1
2
3
4
5
6
7
8
9
10
11
12
def predict(image_sequences, query, batch_size):
    candidates = image_sequences
    while len(candidates) > 1:
        batches = [candidates[i:i+batch_size] for i in range(0, len(candidates), batch_size)]
        next_candidates = []
        for batch in batches:
            prompt = construct_prompt(query, batch)
            index = vlm_select(prompt)
            if index is not None:
                next_candidates.append(batch[index])
        candidates = next_candidates
    return candidates[0].index if candidates else None

End-to-End Pipeline

The final selected proposal's bounding box is retrieved from the Object Profile Table, completing the zero-shot 3D visual grounding pipeline. The framework is agnostic to the choice of VLM, with demonstrated transferability across GPT-4, Qwen-vl-max, and Doubao-1.5-vision-pro.

Experimental Results

SeqVLM achieves state-of-the-art performance on ScanRefer and Nr3D benchmarks, surpassing previous zero-shot methods by absolute margins of 4.0% and 5.2% in [email protected], respectively. On ScanRefer, SeqVLM attains 55.6% [email protected] and 49.6% [email protected], rivaling fully supervised approaches. On Nr3D, SeqVLM achieves 53.2% overall accuracy, with robust gains in both easy and hard, as well as view-dependent and view-independent scenarios. Ablation studies confirm the indispensability of each module, with the Proposal Selection Module contributing the largest performance leap.

VLM Selection and Cost Analysis

Doubao-1.5-vision-pro yields the highest accuracy (49.6% [email protected]) at increased computational cost, while Qwen-vl-max offers a favorable trade-off between accuracy and cost. Cross-method comparisons under controlled VLM settings demonstrate that SeqVLM's architectural innovations, rather than VLM capacity alone, drive the observed performance gains.

Hyper-parameter Sensitivity

Optimal performance is achieved with a VLM batch size threshold L=4L=4 and multi-view frame number n=5n=5. Smaller batch sizes limit candidate contrast, while larger sizes overload the VLM. Excessive views introduce noise, while too few restrict spatial disambiguation.

Implications and Future Directions

SeqVLM advances zero-shot 3D visual grounding by integrating geometric reasoning, multi-view contextual fusion, and scalable VLM-based inference. The framework's ability to match supervised performance without task-specific training has significant implications for real-world deployment in robotics, autonomous driving, and AR/VR systems, where annotation costs and open-vocabulary requirements are prohibitive. The modular design enables adaptation to evolving VLM architectures and sensor modalities.

Future research may explore:

  • End-to-end joint optimization of segmentation, projection, and reasoning modules
  • Incorporation of temporal information for dynamic scene understanding
  • Extension to outdoor and large-scale environments with heterogeneous sensor inputs
  • Efficient model distillation and compression for resource-constrained deployment

Conclusion

SeqVLM presents a robust, transferable framework for zero-shot 3D visual grounding, leveraging proposal-guided multi-view image sequences and iterative VLM reasoning. The method achieves state-of-the-art accuracy on standard benchmarks, demonstrating strong generalization and practical applicability. Its modular architecture and empirical validation establish SeqVLM as a foundational approach for cross-modal 3D scene understanding in open-world settings.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

X Twitter Logo Streamline Icon: https://streamlinehq.com