Papers
Topics
Authors
Recent
2000 character limit reached

LocateAnything3D: VLM-based 3D Detection

Updated 29 November 2025
  • LocateAnything3D is a vision–language model for open-vocabulary 3D object detection that predicts object location and geometry using a sequential chain-of-sight strategy.
  • It employs a near-to-far ordering and intra-object center-size-rotation factorization to reduce ambiguities and enhance prediction stability.
  • Evaluated on Omni3D benchmarks, the model sets new state-of-the-art metrics, demonstrating data efficiency, zero-shot generalization, and robust performance under occlusion.

LocateAnything3D is a vision–LLM (VLM) interface for open-vocabulary 3D object detection, utilizing a disciplined next-token prediction paradigm. The model leverages a "Chain-of-Sight" factorization of the detection process: first locating objects in 2D, then inferring their 3D position, size, and orientation in a sequence that mimics human reasoning. Designed to operate natively within VLM frameworks, LocateAnything3D integrates object-centric, near-to-far ordering across detected instances and a center-size-rotation factorization within each object for robust stability and learnability. It achieves state-of-the-art 3D detection metrics on challenging multi-domain benchmarks, including Omni3D, and demonstrates data-efficient training, zero-shot generalization to novel object categories, and robust qualitative results under occlusion and domain shift (Man et al., 25 Nov 2025).

1. Architectural Overview and Chain-of-Sight Principle

LocateAnything3D casts multi-object 3D detection as sequential next-token generation, mirroring visual chain-of-thought by first finding objects in 2D, then estimating their 3D geometry and transformation. The detection protocol is characterized by two hierarchical orders:

  • Inter-object curriculum: Near-to-far ordering—objects are decoded in order of increasing projected distance from the origin. This reduces ambiguity in predictions for distant objects and optimizes for egocentric utility (e.g., robotics, navigation).
  • Intra-object factorization: Within each object's detection sequence, the model first estimates the center position (from camera coordinates), followed by physical dimensions, then by rotation parameters. This ordering ranks prediction stability: center-from-camera cues exhibit high consistency; size and rotation grow increasingly ambiguous as objects become less visually prominent or are partially occluded.

This "Chain-of-Sight" (CoS, Editor's term) forms an explicit sequence:

  1. 2D box prediction (xmin,ymin,xmax,ymax)(x_{min}, y_{min}, x_{max}, y_{max})
  2. 3D center (X,Y,Z)(X,Y,Z)
  3. Size (W,H,L)(W,H,L)
  4. Rotation RR (yaw or full SO(3))

The CoS strategy anchors 3D inference on explicit 2D localizations, reducing hallucinations and accelerating training convergence. Compared to direct 3D box prediction or randomized token order, CoS delivers higher AP and qualitative reliability (Man et al., 25 Nov 2025).

2. Omni3D Benchmark Evaluation Protocols

LocateAnything3D is evaluated on Omni3D, a large-scale, unified benchmark for monocular 3D detection across heterogeneous domains. Omni3D fuses six public datasets (KITTI, nuScenes, ARKitScenes, SUN-RGBD, Hypersim, Objectron) with a camera-centric annotation schema. Each sample specifies camera intrinsics, projected 2D bounding boxes, and corresponding oriented 3D cuboids (t,d,R)(\mathbf t, \mathbf d, \mathbf R) in the camera frame.

Performance is measured by mean 3D Average Precision (AP3D_{3D}), which averages precision–recall per category across volumetric IoU3D_{3D} thresholds τ{0.05,0.10,,0.50}\tau \in \{0.05,0.10,\dots,0.50\}. For two cuboids bb and b^\hat b, volumetric IoU is defined as: IoU3D(b,b^)=Vol(bb^)Vol(bb^)\mathrm{IoU}_{3D}(b,\hat b)=\frac{\mathrm{Vol}(b\cap \hat b)}{\mathrm{Vol}(b\cup \hat b)} Detection protocols operate in a "target-aware" setting: only ground-truth categories per image are prompted, isolating localization and geometry quality from open-vocab recognition (Man et al., 25 Nov 2025).

3. State-of-the-Art Performance and Comparative Results

LocateAnything3D establishes new state-of-the-art on both full Omni3D and outdoor-only Omni3D_OUT tracks, outperforming recent baselines including Cube R-CNN, DetAny3D, and OVMono3D, even when these are provided with oracle ground-truth 2D boxes. Key results are summarized below (all values are mAP3D_{3D}, averaged over thresholds):

Method AP3D_{3D} (full)
Cube R-CNN 23.26
DetAny3D 24.92
DetAny3D + GT 2D 34.38
LocateAnything3D 49.89

LocateAnything3D delivers a +15.51 absolute mAP improvement over DetAny3D with ground-truth 2D proposals (Man et al., 25 Nov 2025). Per-domain gains are particularly marked in indoor spaces (e.g., ARKitScenes, SUN-RGBD, Hypersim), but strong improvements are also seen for outdoor environments.

4. Curriculum Ablation and Impact of Sequence Structure

Ablation studies isolate the effect of various sequence orderings within LocateAnything3D:

  • Inter-object near-to-far ordering enhances AP by significant margins compared to random or scanline ordering, with measured AP3D_{3D} values: random (31.3), scanline (45.9), near-to-far (52.1).
  • Intra-object CoS factorization (2D→3D) yields 52.1 AP3D_{3D}, outperforming direct 3D decoding (34.6) and reversed 3D→2D order (41.5).
  • Token order within 3D decoding: the center→size→rotation sequence achieves maximal AP, confirming that ordering by predictability and observability optimizes results (AP3D_{3D}: rot→size→center 48.3, center→rot→size 51.9, center→size→rot 52.1) (Man et al., 25 Nov 2025).

This discipline in sequence structure not only yields the best accuracy but also generalizes better with limited data. With only 10% of training images, CoS decoding reaches 33.6 AP3D_{3D} versus 24.2 for direct 3D; with 40%, 42.9 vs. 30.2; with 100%, 52.1 vs. 34.6. Transfer learning from 2D tasks further accelerates 3D convergence.

5. Generalization to Novel Categories and Robustness Analysis

Under zero-shot protocols, LocateAnything3D generalizes to held-out object categories, outperforming external-2D-proposal baselines:

Novel Domain LocateAnything3D AP3D_{3D}
KITTI novel 29.98 (+4.25 over DetAny3D+2D)
SUN-RGBD novel 35.39 (+14.32)
ARKitScenes novel 32.53 (+7.97)

These results confirm that the CoS VLM-native factorization confers superior transfer to unseen classes, even in the absence of specialized 2D detectors (Man et al., 25 Nov 2025).

Failure modes are primarily induced by extreme focal-length variations, excessive clutter (particularly small, distant objects), and unmodeled photometric perturbations that confound monocular depth estimation. Qualitative analyses underscore robust scale consistency, coherence under partial occlusion, and accurate depth ordering— bird's-eye projections reveal reliable near-versus-far instance geometry.

6. Implications for VLM-Integrated Robotics, AR/VR, and Future Research

LocateAnything3D demonstrates that 3D detection can be integrated into VLM architectures without task-specific decoders, heads, or proposal pipelines. Its disciplined token ordering, near-to-far curriculum, and factorized chain-of-sight decoding yield robust, transferable 3D perception across domains. The pragmatic implication is that models trained in this manner can serve as practical building blocks for agent-centric tasks—robotics, AR/VR, scene understanding—by directly predicting object identity and location in egocentric 3D space from monocular sensory input.

A plausible implication is that the CoS sequence principle may extend to other structured vision–language tasks involving hierarchical spatial reasoning, and that curriculum design remains a key driver of generalization and sample efficiency.

LocateAnything3D sets a benchmark for future VLM-native 3D reasoning, establishing clear protocols for evaluation, robust performance with complex scene variation, and zero-shot extension to novel categories (Man et al., 25 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to LocateAnything3D.