Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 100 tok/s
Gemini 2.5 Pro 58 tok/s Pro
GPT-5 Medium 29 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 103 tok/s
GPT OSS 120B 480 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

SeqVLM: Zero-shot 3D Visual Grounding

Updated 4 September 2025
  • The paper introduces SeqVLM, a framework that leverages multi-view spatial reasoning and proposal-guided sequence processing for robust zero-shot 3D visual grounding.
  • It employs a multi-stage methodology including 3D proposal generation, semantic filtering, adaptive projection, and iterative dynamic reasoning to enhance localization accuracy.
  • Empirical results on ScanRefer and Nr3D benchmarks demonstrate improved accuracy over previous methods, highlighting its potential for real-world applications.

SeqVLM is a framework for zero-shot 3D visual grounding that leverages multi-view spatial reasoning and proposal-guided sequence processing with vision-LLMs (VLMs). Designed to address critical limitations in previous zero-shot 3DVG solutions—especially the issues arising from single-view localization and loss of contextual details—SeqVLM integrates dense 3D proposals, semantic filtering, adaptive multi-view projection, and dynamic VLM reasoning to localize objects in 3D scenes via natural language queries, all without requiring scene-specific training. Experimental results on leading benchmarks demonstrate state-of-the-art retrieval accuracy and generalization, positioning SeqVLM as a substantial advance in unsupervised 3D scene understanding (Lin et al., 28 Aug 2025).

1. Motivation and Conceptual Foundation

SeqVLM is motivated by the need for robust object localization in complex, real-world 3D environments where exhaustive annotation or task-specific retraining is impractical. Existing zero-shot approaches often falter due to spatially limited reasoning inherent in single-view renderings and the omission of crucial contextual details caused by occlusions or projection misalignments. SeqVLM systematically addresses these limitations by incorporating multi-view real-world imagery and 3D geometry at the proposal level, allowing comprehensive reasoning over both spatial and semantic cues.

Zero-shot 3DVG—the process of localizing objects in 3D scenes given unconstrained textual input without domain-specific supervision—is increasingly relevant for applications such as autonomous driving, mixed-reality interfaces, and mobile robotics. SeqVLM’s structure enables transferable grounding skills across scenes, maximizing applicability in general-purpose 3D reasoning tasks.

2. Proposal-Guided Multi-Stage Methodology

2.1 3D Instance Proposal Generation

The initial stage of SeqVLM utilizes a 3D semantic segmentation network (e.g., Mask3D) to process input point clouds. The network identifies and outputs a set of candidate segmentation masks, each representing a potential object instance with an associated confidence score:

P(P)={Mio(Mi)Θ}\mathcal{P}(\mathbf{P}) = \{ M_i \mid o(M_i) \geq \Theta \}

where MiM_i denotes a candidate mask and o()o(\cdot) is the confidence function with filtering threshold Θ\Theta (Equation 1).

2.2 Semantic Filtering

Semantic filtering removes category-irrelevant proposals by leveraging a text-driven alignment pipeline:

  • A LLM interprets the user query, extracting the target class.
  • The candidate proposal categories and the target category are embedded using a text encoder (CLIP-ViT-Base-Patch16).
  • Cosine similarity Si=EtE(i)S_i = E_t \cdot E(i) is computed, where EtE_t and E(i)E(i) are target and proposal embeddings, respectively (Equations 3–4).
  • Only proposals with maximal semantic alignment (C=argmaxSiC^* = \arg\max S_i) are retained for downstream processing (Equation 5).

2.3 Proposal-Guided Multi-View Projection

SeqVLM performs geometry-aware projection:

  • Each retained mask’s 3D points Pw=[xw,yw,zw]P_w = [x_w, y_w, z_w] are transformed into the camera coordinate frame via:

Pc=Twc[xw,yw,zw,1]P_c = T_{wc} [x_w, y_w, z_w, 1]

where TwcT_{wc} denotes the world-to-camera transformation (Equation 7).

  • Projected 2D image coordinates (u,v)(u, v) are calculated with the intrinsic matrix KK (Equations 8–9).
  • From multi-view images, the nn views with largest projected regions are selected. Each bounding region is extracted and mildly expanded:

R=[Umin,Vmin,Umax,Vmax]\mathcal{R} = [U_{min}, V_{min}, U_{max}, V_{max}]

(Equation 10).

  • Annotated crop regions are vertically stacked, producing an ordered multi-view sequence that preserves both geometric and contextual fidelity.

2.4 Iterative Dynamic Reasoning (Dynamic Scheduling)

Given significant VLM input constraints, SeqVLM applies a dynamic scheduling algorithm:

  • Multi-view sequences are split into batches (length LL).
  • In each reasoning round, the VLM (e.g., Doubao-1.5-vision-pro) receives a batch paired with the text query. If a batch yields a plausible candidate, it is selected for the next round.
  • The process is iterative (Algorithm 1), refining the candidate pool via successive VLM evaluations until a unique target is isolated.

This mechanism balances computational resource usage with responsiveness, while exploiting cross-modal reasoning capabilities to maximize grounding accuracy.

3. Experimental Protocol and Empirical Performance

3.1 Benchmarks and Setup

SeqVLM is evaluated using ScanRefer and Nr3D, standard datasets in 3DVG:

  • Each experimental run inputs a registered 3D scene point cloud, multi-view real images, and a natural language description.
  • Proposals are generated (Mask3D), filtered semantically, and prepared for VLM-based reasoning as described.

3.2 Metrics

Performance is reported for accuracy at IoU thresholds (Acc@0.25[email protected], Acc@0.5[email protected]):

  • Acc@kAcc@k measures fraction of samples where intersection-over-union between predicted and ground-truth boxes exceeds kk.

3.3 Results Relative to Prior Art

SeqVLM attains:

This surpasses prior zero-shot systems by +4.0%+4.0\% (ScanRefer) and +5.2%+5.2\% (Nr3D), with some results approaching fully-supervised systems (Lin et al., 28 Aug 2025).

Ablation studies confirm that proposal filtering, multi-view projection, and iterative scheduling each uniquely contribute to accuracy improvements.

4. Technical Analysis: Algorithms and Mathematical Formulation

4.1 Key Formulas and Algorithms

Component Mathematical Expression Purpose
Instance proposal filtering P(P)={Mi  o(Mi)Θ}\mathcal{P}(\mathbf{P}) = \{ M_i\ |\ o(M_i)\geq Θ \} Selects confident segmentation instances
Semantic matching Si=EtE(i)S_i = E_t \cdot E(i) Computes cosine similarity for proposal–query
Projection to camera/image plane Pc=Twc[xw,yw,zw,1]P_c = T_{wc}[x_w, y_w, z_w, 1] 3D \rightarrow 2D transformation
Bounding box extraction R=[Umin,Vmin,Umax,Vmax]\mathcal{R} = [U_{min}, V_{min}, U_{max}, V_{max}] Extracts ROI for proposal in each view

Iterative dynamic scheduling (Algorithm 1) employs batch-wise VLM reasoning, reducing the candidate set until only the target instance remains.

4.2 Implementation Considerations

  • Semantic filtering relies on the quality of both LLM query parsing and category embedding representations.
  • Intrinsics/extrinsics of multi-view images must be accurately registered for projection integrity.
  • VLM input-output constraints necessitate careful batching and batch size tuning to avoid loss of context or computational overrun.
  • Candidate proposals may require expansion of bounding boxes to retain full context for ambiguous or large objects.

5. Broader Implications, Applications, and Future Directions

SeqVLM enables precise, context-preserving object localization via language—without retraining—in scenes with novel geometry, occlusions, or viewpoint diversity. This establishes a new paradigm for zero-shot 3DVG, suggesting the potential for cross-domain deployment in:

  • Embodied agents and robotics navigating unstructured spaces
  • AR/VR scene annotation for user walkthroughs
  • 3D video or scene retrieval from natural language queries

A plausible implication is that as VLMs and multimodal foundation models become increasingly powerful and data-efficient, SeqVLM-style compositional pipelines can generalize to a wider variety of object classes, scene complexities, and multi-modal signals (including temporal and audio cues).

Ongoing research may further enhance dynamic scheduling, close the gap between rendered and real image projections, advance semantic filtering (e.g., via stronger or more contextually sensitive LLMs), and optimize the efficiency of the multi-stage pipeline for real-time and embedded deployments.

6. Summary

SeqVLM introduces a multi-component, proposal-centric methodology to zero-shot 3D visual grounding, integrating robust segmentation, multi-view geometry-aware projection, semantic alignment, and dynamic VLM reasoning. Its empirical superiority on conventional benchmarks and methodological flexibility substantiate its position as a significant contribution to unsupervised 3D scene understanding, with far-reaching implications for scalable, context-driven vision-language grounding in practical environments (Lin et al., 28 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)