- The paper presents a proposal-guided framework that blends 3D segmentation with multi-view image sequences to enhance zero-shot 3D visual grounding.
- It introduces an iterative VLM reasoning module to dynamically select the best candidate proposals while addressing spatial ambiguities and input constraints.
- Experimental results on ScanRefer and Nr3D benchmarks demonstrate significant accuracy gains, rivaling fully supervised approaches.
SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding
Introduction
SeqVLM introduces a zero-shot 3D visual grounding framework that leverages proposal-guided multi-view image sequences and visual-LLMs (VLMs) to localize objects in 3D scenes based on natural language queries. The method addresses the limitations of prior zero-shot approaches, which suffer from spatial reasoning constraints, contextual omissions, and detail degradation due to reliance on single-view renderings and domain gaps between synthetic and real images. SeqVLM integrates 3D semantic segmentation, multi-view real-world image projection, and iterative VLM-based reasoning to achieve robust cross-modal alignment and precise object localization without scene-specific training.
Methodology
Proposal Selection Module
SeqVLM employs a 3D semantic segmentation network (e.g., Mask3D) to generate instance proposals from the input point cloud. Semantic filtering is performed by embedding both the proposal categories and the LLM-parsed target category using a text encoder (CLIP-ViT-Base-Patch16). Cosine similarity is computed to retain only proposals semantically aligned with the query, reducing the candidate set and computational complexity for subsequent VLM reasoning. This module bridges 3D segmentation with language semantics, providing a critical performance boost.
Proposal-Guided Multi-View Projection
To adapt 3D proposals for VLM processing, SeqVLM projects each candidate onto multiple real-world scene images, selecting the top n views with maximal projected area. For each view, the proposal's 3D coordinates are transformed to camera coordinates and mapped to 2D pixel locations using intrinsic and extrinsic matrices. Depth consistency checks ensure valid projections. The bounding box for each proposal is expanded and annotated, and the selected views are vertically concatenated to form an image sequence that preserves spatial relationships and contextual details. This multi-view fusion mitigates occlusion and viewpoint ambiguity, enhancing VLM's cross-modal reasoning.
VLM Iterative Reasoning Module
SeqVLM introduces an iterative reasoning mechanism to address VLM input length constraints and computational overload. Candidate image sequences are batched (default L=4), and each batch is paired with the textual query in a prompt template. The VLM agent selects the best-matching candidate per batch, and the process iterates until a single candidate remains. This dynamic scheduling optimizes both inference efficiency and localization accuracy, circumventing VLM limitations in long-sequence reasoning.
Pseudocode for Iterative Reasoning
1
2
3
4
5
6
7
8
9
10
11
12
|
def predict(image_sequences, query, batch_size):
candidates = image_sequences
while len(candidates) > 1:
batches = [candidates[i:i+batch_size] for i in range(0, len(candidates), batch_size)]
next_candidates = []
for batch in batches:
prompt = construct_prompt(query, batch)
index = vlm_select(prompt)
if index is not None:
next_candidates.append(batch[index])
candidates = next_candidates
return candidates[0].index if candidates else None |
End-to-End Pipeline
The final selected proposal's bounding box is retrieved from the Object Profile Table, completing the zero-shot 3D visual grounding pipeline. The framework is agnostic to the choice of VLM, with demonstrated transferability across GPT-4, Qwen-vl-max, and Doubao-1.5-vision-pro.
Experimental Results
SeqVLM achieves state-of-the-art performance on ScanRefer and Nr3D benchmarks, surpassing previous zero-shot methods by absolute margins of 4.0% and 5.2% in [email protected], respectively. On ScanRefer, SeqVLM attains 55.6% [email protected] and 49.6% [email protected], rivaling fully supervised approaches. On Nr3D, SeqVLM achieves 53.2% overall accuracy, with robust gains in both easy and hard, as well as view-dependent and view-independent scenarios. Ablation studies confirm the indispensability of each module, with the Proposal Selection Module contributing the largest performance leap.
VLM Selection and Cost Analysis
Doubao-1.5-vision-pro yields the highest accuracy (49.6% [email protected]) at increased computational cost, while Qwen-vl-max offers a favorable trade-off between accuracy and cost. Cross-method comparisons under controlled VLM settings demonstrate that SeqVLM's architectural innovations, rather than VLM capacity alone, drive the observed performance gains.
Hyper-parameter Sensitivity
Optimal performance is achieved with a VLM batch size threshold L=4 and multi-view frame number n=5. Smaller batch sizes limit candidate contrast, while larger sizes overload the VLM. Excessive views introduce noise, while too few restrict spatial disambiguation.
Implications and Future Directions
SeqVLM advances zero-shot 3D visual grounding by integrating geometric reasoning, multi-view contextual fusion, and scalable VLM-based inference. The framework's ability to match supervised performance without task-specific training has significant implications for real-world deployment in robotics, autonomous driving, and AR/VR systems, where annotation costs and open-vocabulary requirements are prohibitive. The modular design enables adaptation to evolving VLM architectures and sensor modalities.
Future research may explore:
- End-to-end joint optimization of segmentation, projection, and reasoning modules
- Incorporation of temporal information for dynamic scene understanding
- Extension to outdoor and large-scale environments with heterogeneous sensor inputs
- Efficient model distillation and compression for resource-constrained deployment
Conclusion
SeqVLM presents a robust, transferable framework for zero-shot 3D visual grounding, leveraging proposal-guided multi-view image sequences and iterative VLM reasoning. The method achieves state-of-the-art accuracy on standard benchmarks, demonstrating strong generalization and practical applicability. Its modular architecture and empirical validation establish SeqVLM as a foundational approach for cross-modal 3D scene understanding in open-world settings.