WildDet3D-Data: Open-World Monocular 3D Detection
- The paper introduces a scalable methodology that converts existing 2D annotations into 3D detections using geometric estimation, language-model filtering, and human verification.
- WildDet3D-Data is a comprehensive dataset with over 1M images, 3.7M valid 3D boxes, and 13.5K categories spanning indoor, urban, and natural scenes.
- Benchmark results highlight significant performance improvements in promptable monocular 3D detection when integrating depth cues and multi-stage candidate optimization.
WildDet3D-Data is a large-scale open-world monocular 3D detection dataset introduced alongside the unified WildDet3D architecture for promptable 3D detection from a single RGB image. It is constructed from existing 2D annotations in COCO, LVIS, Objects365, and V3Det, and pairs a candidate-generation-and-selection pipeline with human verification and VLM filtering to produce valid 3D boxes at unprecedented scale: 1,003,886 images, 229,934 human-verified 3D boxes, 3,483,292 VLM-filtered boxes, 3.7 M valid 3D annotations in total, and 13,499 unique categories in diverse real-world scenes (Huang et al., 9 Apr 2026). Within the WildDet3D framework, the dataset functions both as a training resource for open-vocabulary monocular 3D detection and as a benchmark substrate for evaluating text-prompted, box-prompted, and depth-assisted inference (Huang et al., 9 Apr 2026).
1. Definition and scope
WildDet3D-Data was introduced to address a specific bottleneck in monocular 3D object detection: existing 3D datasets cover only narrow categories in controlled environments, limiting open-world transfer (Huang et al., 9 Apr 2026). The dataset therefore targets breadth of vocabulary, heterogeneity of scene type, and scalable 3D annotation over image collections that originally contained only 2D supervision.
The source corpus is drawn from COCO, LVIS, Objects365, and V3Det, covering over 1 million images and 13.5 K object categories (Huang et al., 9 Apr 2026). The resulting data distribution is explicitly long-tail. The dataset statistics report 13.5 K categories overall, while the validation and test sets cover 881 categories; 826/881 validation categories have at least one train example, and approximately 820 have at least three (Huang et al., 9 Apr 2026). Scene diversity is also quantified: Indoor 52% (offices, kitchens, living rooms), Urban 32% (streets, markets, parks), and Nature 15% (wildlife, forests, water) (Huang et al., 9 Apr 2026).
A central design feature is that WildDet3D-Data is not a purely manual annotation effort. Instead, it combines monocular geometry estimation, multi-model 3D proposal generation, rule-based and LLM-based filtering, human verification for a balanced subset, and VLM selection for scale (Huang et al., 9 Apr 2026). This suggests that the dataset is intended as a compromise between annotation fidelity and category-scale coverage rather than as a conventional closed-set benchmark with uniformly hand-labeled 3D boxes.
2. Construction pipeline
The construction pipeline has three main stages: candidate generation, rule-based and LLM-based filtering, and candidate selection (human + VLM) (Huang et al., 9 Apr 2026). For each 2D box or mask, multiple 3D-box candidates are generated, geometrically implausible candidates are filtered, and a single human-verified 3D box is retained or the instance is marked ignored (Huang et al., 9 Apr 2026).
The monocular geometry stage begins with 4× diffusion super-resolution of the input image, followed by metric monocular depth estimation by MoGe-2 at 1024px long edge (Huang et al., 9 Apr 2026). Camera intrinsics
and extrinsic roll/pitch are predicted by PerspectiveFields and WildCamera (Huang et al., 9 Apr 2026). The depth map is then back-projected into a 3D point cloud:
Candidate generation uses five complementary methods for each 2D annotation: 3D-MOOD, DetAny3D, SAM-3D, RANSAC-PCA, and LabelAny3D (Huang et al., 9 Apr 2026). Their roles are differentiated. 3D-MOOD is an open-vocabulary 3D detector matched to GT 2D boxes by 2D IoU; DetAny3D performs direct regression from 2D crop features; SAM-3D reconstructs a triangular mesh from the object mask and depth cloud and fits an oriented box to the mesh vertices; RANSAC-PCA clusters mask points, removes outliers, fits a minimum-area rectangle in the principal-axes frame, and aligns to gravity; LabelAny3D performs single-image 3D mesh synthesis from 2D crops and then scene alignment (Huang et al., 9 Apr 2026).
The proposal optimization stage aligns candidate boxes to the scene point cloud. Translation is refined by minimizing
where pushes sample points inside the box and pulls box faces toward points (Huang et al., 9 Apr 2026). Optimization proceeds first by a grid search and then by L-BFGS-B. Rotation is refined via PCA gravity alignment and 2D-projection consistency. Each output is then merged into a unified 10D format:
Filtering combines explicit geometric heuristics and language-model-based plausibility constraints. Geometric filters remove candidates with edge-contact ratio below 3%, RANSAC occlusion ratio above 15%, or 3D-to-2D projected size ratio outside (Huang et al., 9 Apr 2026). A depicted-object filter uses a VLM to remove posters, screens, and reflections; a composite-image filter discards collages; and a GPT-4.1-mini size/shape filter enforces per-category axis-length ranges and depth/width ratios (Huang et al., 9 Apr 2026). A small-object upgrade allows qualifying small annotations with 2D area below 0.5% of image area to re-enter if high-quality (Huang et al., 9 Apr 2026).
3. Human verification and synthetic scaling
Candidate selection is split between a human annotation path and a VLM-based path. For a balanced subset of 102 k images, Prolific workers are shown up to five candidates in four views—an RGB overlay and three orthographic point-cloud views—then choose the best candidate and rate it as good_fit, acceptable, or unacceptable (Huang et al., 9 Apr 2026). Annotation quality control is explicit: batches contain 50 tasks and 5 gold-standard checks, and workers must catch at least 2/5; annotator pass rates are reported as 84–98% across splits (Huang et al., 9 Apr 2026).
The synthetic path is controlled by VLM selection. Molmo 2 scores each candidate on six criteria—category, scale, translation, shape, rotation, and tilt—for a maximum of 11 points, and a candidate is kept if its score exceeds 10 (Huang et al., 9 Apr 2026). The paper reports that Table 2 shows a perfect monotonic correlation between VLM score and human rejection, with Spearman (Huang et al., 9 Apr 2026). This suggests that the VLM filter was used not merely for throughput, but as a calibrated proxy for human quality judgments in the larger synthetic portion of the data.
The relative contributions of proposal generators are also quantified in the validation analysis. Table 2 reports that SAM-3D accounts for 40.4% of selected candidates with 17.3% rejected by humans, while RANSAC-PCA accounts for 28.2% with 12.5% rejected (Huang et al., 9 Apr 2026). The ellipsis in the summary indicates that additional per-model rows exist in the paper, but only these two values are specified in the supplied details.
The dataset organization reflects this two-track annotation strategy:
| Split | Images | 3D annotations | Categories |
|---|---|---|---|
| Train–Human | 102,979 | 229,934 | 12,064 |
| Train–Synthetic | 896,004 | 3,483,292 | 11,896 |
| Val (human) | 2,470 | 9,256 | 785 |
| Test (human) | 2,433 | 5,596 | 633 |
These counts imply that WildDet3D-Data separates high-confidence human-verified supervision from larger-scale synthetic supervision while retaining human-only validation and test partitions (Huang et al., 9 Apr 2026).
4. Annotation schema and coordinate conventions
Each annotation record contains a 3D box in camera-centric metric coordinates together with categorical and quality metadata (Huang et al., 9 Apr 2026). The box parameterization includes center in meters, dimensions 0 in meters, and orientation as a unit quaternion 1 (Huang et al., 9 Apr 2026). The category label is stored as an open-vocabulary string. Each instance also carries a quality flag in the set {good_fit, acceptable, unacceptable}, with unacceptable mapped to ignore3D=1 (Huang et al., 9 Apr 2026). A source indicator distinguishes human versus VLM provenance.
The coordinate convention is a right-handed camera frame in which 2 is positive to the right, 3 is positive down, and 4 is positive into the scene (Huang et al., 9 Apr 2026). This choice matters for downstream interoperability because many 3D detection pipelines adopt dataset-specific axis conventions, and WildDet3D-Data makes its convention explicit.
Additional metadata include the ignore3D bit for unusable 3D annotations, per-annotation VLM score for synthetic labels, scene category, depth range (near, mid, far), and image-level intrinsics 5 and depth-map availability (Huang et al., 9 Apr 2026). Because the construction pipeline depends on estimated depth and estimated camera parameters, the presence of image-level intrinsics and depth availability is operationally significant for both reproducibility and re-evaluation.
5. Benchmark protocols and reported performance
WildDet3D-Data supports benchmark evaluation through AP6 and Open Detection Score (ODS) protocols (Huang et al., 9 Apr 2026). On Omni3D, AP7 uses 3D-IoU matching and is averaged over IoU thresholds 8 (Huang et al., 9 Apr 2026). On WildDet3D-Bench and Stereo4D, AP9 uses center-distance matching:
0
Results are reported overall and stratified by category frequency: Rare (1 images), Common (2–3), and Frequent (4) (Huang et al., 9 Apr 2026).
Zero-shot transfer on Argoverse 2 and ScanNet is measured with ODS:
5
where mATE, mAOE, and mASE are mean translation, orientation, and scale errors (Huang et al., 9 Apr 2026).
The benchmark framing is promptable. WildDet3D-Bench uses either a text query or an oracle 2D box prompt (Huang et al., 9 Apr 2026). In the open-world setting, WildDet3D achieves 22.6 AP6 with text prompts and 24.8 AP7 with box prompts on WildDet3D-Bench (Huang et al., 9 Apr 2026). On Omni3D, it reaches 34.2 and 36.4 AP8 with text and box prompts, respectively (Huang et al., 9 Apr 2026). In zero-shot evaluation, it achieves 40.3 and 48.9 ODS on Argoverse 2 and ScanNet (Huang et al., 9 Apr 2026). A further reported result is that incorporating depth cues at inference time yields +20.7 AP on average across settings (Huang et al., 9 Apr 2026).
The specific contribution of WildDet3D-Data appears in the WildDet3D-Bench baseline table. 3D-MOOD with text prompt obtains 2.3 AP; WildDet3D without the new data obtains 6.8 AP; WildDet3D with WildDet3D-Data reaches 22.6 AP; and the addition of ground-truth depth raises performance to 41.6 AP (Huang et al., 9 Apr 2026). This indicates that dataset scale, vocabulary coverage, and geometric side information all materially affect open-world monocular 3D detection performance.
6. Strengths, limitations, and relation to adjacent datasets
The stated strengths of WildDet3D-Data are scale, vocabulary breadth, scene diversity, long-tail support, and a quality-control regime that combines human verification with VLM pre-filtering (Huang et al., 9 Apr 2026). The paper summarizes this as 1 M images, 3.7 M boxes, and 13.5 K categories, corresponding to 138× category coverage versus Omni3D (Huang et al., 9 Apr 2026). Because scenes span indoor, urban, and natural settings, the dataset is designed to support open-world transfer rather than only domain-specific generalization.
Several limitations are also made explicit. Annotation bias is one: the Prolific pool is reported as 86% US/UK/CA and 78% White, which may introduce cultural biases in quality judgments (Huang et al., 9 Apr 2026). Camera calibration is another: intrinsics are estimated rather than factory-calibrated, which may degrade metric fidelity (Huang et al., 9 Apr 2026). Monocular depth errors propagate directly into candidate boxes, and the paper notes dramatic improvements when real depth is provided (Huang et al., 9 Apr 2026). Rotational symmetry remains a challenge for orientation, the dual-backbone pipeline is heavyweight and not suited for fully on-device real-time use, and very rare categories still exhibit high variance in box accuracy (Huang et al., 9 Apr 2026).
A useful comparison can be drawn with WildDepth, a multimodal dataset for 3D wildlife perception and depth estimation that uses synchronized RGB and LiDAR and provides metric-scale dense depth, 3D bounding boxes, and behavior labels across 29 animal species (Aamir et al., 17 Mar 2026). WildDepth emphasizes metric-scale multimodal perception in animal-centered scenes, whereas WildDet3D-Data emphasizes open-world 3D detection scale through monocular construction over broad internet-style image corpora. This suggests complementary roles: WildDepth is suited to controlled evaluation of depth reliability and multimodal fusion in wildlife settings, while WildDet3D-Data is suited to large-vocabulary open-world monocular 3D detection and promptable category generalization.
Taken together, WildDet3D-Data occupies a distinct position in the 3D perception landscape. Its contribution is not only the number of images or categories, but also a specific annotation philosophy: generate diverse monocular 3D hypotheses, filter them with geometry and language priors, validate a subset with humans, and scale the remainder with VLM scoring (Huang et al., 9 Apr 2026). For research on open-vocabulary 3D detection, long-tail category transfer, and prompt-conditioned monocular spatial understanding, that design is the defining characteristic of the dataset.