- The paper introduces a unified promptable framework that flexibly integrates text, point, and box cues to drive open-vocabulary monocular 3D detection.
- It employs a dual-encoder architecture with a residual depth fusion module and geometric augmentation, achieving significant AP improvements across diverse benchmarks.
- The work scales to over 13.5K categories using a large, human-verified 3D dataset, paving the way for robust embodied AI, AR, and robotics applications.
WildDet3D: A Unified, Promptable Framework for Open-Vocabulary Monocular 3D Detection
Introduction
WildDet3D addresses the pivotal challenge of enabling open-vocabulary, monocular 3D object detection from single images with flexible, prompt-driven interfaces and robust generalization across diverse, real-world settings. Prior methods are constrained by closed-set vocabularies, fixed prompt modalities, and the inability to ingest optional geometric cues such as partial depth information. The WildDet3D architecture and its associated large-scale dataset, WildDet3D-Data, systematically resolve these barriers, establishing a new technical paradigm for 3D detection that unifies text, point, and box prompts and supports geometric augmentation at inference.
Figure 1: WildDet3D performs open-vocabulary monocular 3D object detection from a single RGB image and optional depth, supporting multiple flexible prompt modalities and scaling to thousands of categories in diverse open-world environments.
Motivation and Modality Considerations
Monocular 3D object detection remains challenging due to scale ambiguity, occlusions, and the necessity for semantic generalization. Traditional approaches based on LiDAR or RGB modalities are either geometrically rich but semantically poor, or vice versa. WildDet3D leverages the complementary strengths of dense visual semantics from RGB and resolves metric scale ambiguity whenever optional depth (from LiDAR, stereo, or ToF sensors) is available, all within a unified modality-agnostic framework.
Figure 2: Input modality trade-offs reveal the optimality of combining RGB with optional depth, balancing rich semantics with scalable geometric grounding.
Model Architecture and Training
WildDet3D operationalizes a dual-encoder paradigm: a segmentation-pretrained ViT-H serves as the core image backbone, while a DINOv2 ViT-L/14-based RGBD encoder, optionally incorporating external depth, generates geometry-aware representations. These streams are fused via a residual depth fusion module, supporting graceful fallback to monocular mode when no external geometric signals are present. Prompt handling is generic and modular: text, 2D points, 2D boxes, and exemplars are encoded and injected as queries to the detection head, ensuring modality-agnostic conditioning and strong interactive capability.
The 3D detection head utilizes a transformer-based hierarchy with deep supervision, producing multi-source features for box regression, confidence prediction (with a geometry-aware scoring objective), and unambiguous rotation normalization to resolve symmetric ambiguities via canonicalization of the yaw-angle and dimension statistics. Joint 2D+3D detection is critical—L1 and multi-task geometry aware losses are applied to all decoder layers, and auxiliary detection/depth heads enforce additional regularization and convergence acceleration.
Figure 3: The dual-encoder backbone, depth fusion mechanism, and promptable head structure facilitate multi-modality, flexible prompt-based inference, and robust metric detection.
Data Pipeline: WildDet3D-Data
Scaling open-vocabulary 3D detection necessitates data with large categorical coverage and realistic diversity. WildDet3D-Data comprises over 1 million images and 3.7 million human-verified 3D annotations across 13.5K categories—achieved by 3D lifting from high-quality 2D annotation sources (COCO, LVIS, Objects365, V3Det) via a candidate generation process using five complementary 3D lifting models. Multi-stage rule-based and learned filtering (including geometric thresholds, VLM/LLM-based category plausibility, and human-in-the-loop verification) ensure annotation quality and bias mitigation.
WildDet3D robustly outperforms all prior monocular and promptable 3D detection approaches:
- On WildDet3D-Bench (700+ open-vocabulary categories):
- WildDet3D achieves 22.6 (text) / 24.8 (box) AP3D​, with ground-truth depth boosting AP to 41.6/47.2—a nearly twofold improvement from additional geometric input.
- The model generalizes to rare, common, and frequent categories, with the largest gains for previously unseen (rare) categories.
- On Omni3D (standard benchmark):
- 34.2 AP3D​ (text) and 36.4 AP3D​ (box) with only 12 training epochs (an order-of-magnitude less than competing methods).
- With real depth, reaches 45.8 AP3D​ (box), demonstrating effective integration of partial geometry at inference.
- Zero-shot generalization:
- 40.3 (Argoverse 2) and 48.9 (ScanNet) ODS, significantly surpassing previous strong baselines, particularly in geometric localization and orientation on novel categories and environments.
Importantly, several ablations highlight:
- Joint 2D+3D prediction is critical (removal collapses AP by >19).
- One-to-many matching and explicit geometric supervision drive substantial AP gains.
- Depth input at inference provides major improvements across scene types, with especially large effects on indoor, cluttered, or ambiguous-depth settings.
Deployment and Applications
WildDet3D is deployable across heterogeneous platforms, including mobile (iPhone via ARKit sensor fusion), AR (Meta Quest/Glasses), robotics (franka arm for grasping via open-vocabulary 3D detection), and as a spatial reasoning module for vision-LLMs. Its promptable interface enables seamless integration with language-conditioned workflows, supporting both human-in-the-loop and agent-driven interaction.
Implications and Future Directions
The strong empirical results, multi-modal prompt handling, and scalable data annotation pipeline position WildDet3D as a robust foundation for generic visual spatial reasoning in open-world scenarios. The architecture’s support for optional geometric input and high efficiency make it a compelling module for practical embodied AI, AR/VR, and interactive robotics deployments. However, performance on rare categories and ambiguity in single-image depth estimations remain challenging—motivating future work on self-supervised metric depth grounding, on-device model distillation, robust intrinsic estimation, and more sophisticated prompt fusion.
The proposed dataset and model provide a foundation for research into open-world, promptable, multi-modal spatial perception, and illustrate the potential of large-scale data curation combined with geometry-aware vision-language modeling for scaling generalist robotic and perception systems.
Conclusion
WildDet3D advances monocular 3D detection by consolidating a promptable, open-vocabulary architecture with scalable in-the-wild annotation, achieving state-of-the-art performance across diverse benchmarks while supporting flexible, geometry-augmented deployment. The framework opens new avenues for vision-LLMs grounded in 3D space, and establishes a rigorous blueprint for the next generation of unified spatial perception systems (2604.08626).