WildDet3D-Bench: Open-Vocabulary 3D Detection

Updated 4 July 2026

The paper introduces WildDet3D-Bench, a benchmark that evaluates open-vocabulary 3D detection using multiple prompt modalities and optional depth cues.
WildDet3D-Bench is derived from a large-scale in-the-wild dataset covering 700+ categories with human-verified 3D annotations, highlighting long-tail behavior and scene diversity.
It establishes a joint 2D+3D prediction protocol and specialized evaluation metrics, emphasizing the importance of geometric supervision and flexible prompting in 3D detection.

Searching arXiv for the specified paper to ground the article in the cited source. WildDet3D-Bench is an in-the-wild evaluation benchmark for promptable open-vocabulary monocular 3D detection introduced with "WildDet3D: Scaling Promptable 3D Detection in the Wild" (Huang et al., 9 Apr 2026). It is a validation-set-derived benchmark built from the authors’ large-scale WildDet3D-Data and is intended to evaluate whether a detector can recover 3D extent, location, and orientation from a single RGB image under flexible prompting conditions. Its role is to measure open-world capability under text prompts, point clicks, 2D boxes, and optional depth cues across long-tailed categories and diverse scenes, addressing the limitations of prior 3D detection benchmarks that emphasize limited categories, controlled domains, or fixed interaction modes (Huang et al., 9 Apr 2026).

1. Scope, motivation, and benchmark definition

WildDet3D-Bench was introduced to address two bottlenecks identified in monocular 3D object detection. The first is a model-side gap: prior 3D detectors usually support only one prompt type, most often text-based open-vocabulary querying or oracle 2D boxes, and do not naturally incorporate extra geometry like depth at inference time. The second is a data-side gap: existing 3D datasets are either small in category coverage or restricted to a few domains like driving or indoor scenes, making them poor proxies for open-world deployment (Huang et al., 9 Apr 2026).

Within that framing, WildDet3D-Bench evaluates open-vocabulary 3D object detection in the wild under flexible prompts. The benchmark is explicitly designed to test whether a model can detect hundreds of open-vocabulary categories, handle long-tailed category frequency, operate in messy, unconstrained, in-the-wild imagery, and benefit from optional geometric cues when they are available. This places the benchmark at the intersection of open-vocabulary recognition, monocular geometry, promptable interaction, and partially supervised evaluation.

A plausible implication is that WildDet3D-Bench functions not merely as a dataset split but as an evaluation protocol for open-world 3D perception. In that sense, its contribution is methodological as well as empirical: it specifies what kinds of prompting, matching, and ignore-aware handling are necessary to assess monocular 3D detectors beyond closed-set settings.

2. Dataset composition, category structure, and scene diversity

WildDet3D-Bench is derived from the validation split of WildDet3D-Data. It spans 700+ open-vocabulary categories drawn from COCO, LVIS, and Objects365, with human-verified 3D annotations (Huang et al., 9 Apr 2026). The benchmark categories are grouped by annotation frequency into three strata: rare, common, and frequent. The reported counts are 464 categories with fewer than 5 samples, 283 categories with 5–20 samples, and 63 categories with more than 20 samples.

Frequency group	Definition	Categories
rare	fewer than 5 samples	464
common	5–20 samples	283
frequent	more than 20 samples	63

This frequency stratification is central to the benchmark’s design because it exposes long-tail behavior directly in the evaluation protocol. The rare/common/frequent partition is not an auxiliary statistic; it is tied to the reported metrics through AP $_\text{rare}$ , AP $_\text{common}$ , AP $_\text{frequent}$ , and overall AP $_\text{3D}$ .

The underlying data is explicitly described as “in the wild,” with broad scene coverage. The larger WildDet3D-Data covers 22 scene categories, and the paper highlights macro groupings of Indoor: 52%, Urban: 32%, and Nature: 15% (Huang et al., 9 Apr 2026). The benchmark therefore mixes scene types rather than specializing to a single domain such as autonomous driving or indoor scanning. This suggests that WildDet3D-Bench is intended to stress transfer across object scales, layouts, viewpoints, and background clutter rather than reward domain-specific optimization.

3. Annotation pipeline and evaluation protocol

WildDet3D-Bench inherits the broader annotation pipeline used in WildDet3D-Data. Candidate 3D boxes are first generated from existing 2D annotations using multiple methods, then filtered with geometric and semantic rules. Final labels are selected either by human annotators or by a VLM-based selection model when human annotation is not used. For WildDet3D-Bench specifically, the validation split used for the benchmark contains human-verified 3D annotations (Huang et al., 9 Apr 2026).

The human verification process is defined in detail. Annotators view each candidate from four viewpoints: a perspective overlay on the input image and three orthographic point-cloud views. They choose the best candidate and rate it as good_fit, acceptable, or unacceptable. Quality-control batches include gold tasks, and workers failing gold thresholds are discarded and reassigned. These details matter because the benchmark is built from candidate generation followed by verification rather than direct 3D annotation from scratch.

The evaluation protocol is adapted to open-world 3D detection. It uses text prompt and box prompt settings and reports AP $_\text{3D}$ computed using center-distance matching rather than standard 3D IoU. The center-distance criterion is defined as

$\|\hat{\mathbf{c}} - \mathbf{c}^*\| < \tau \cdot r$

with

$r = \frac{\|\mathbf{d}^*\|_2}{2}$

and thresholds

$\tau \in \{0.50, 0.55, \dots, 1.00\}.$

Because annotations are not exhaustive, WildDet3D-Bench uses federated/ignore-aware evaluation: predictions overlapping 2D-annotated objects without valid 3D boxes are treated as neutral, not false positives (Huang et al., 9 Apr 2026). This is a consequential design choice for open-world evaluation, since otherwise incompletely annotated regions would systematically penalize detectors for discovering plausible objects outside the validated 3D subset.

4. Prompt modalities and the associated detection model

The benchmark is paired with WildDet3D, a geometry-aware promptable 3D detection system that supports multiple prompt types within a single architecture (Huang et al., 9 Apr 2026). The supported prompt modalities are Text prompt, Point prompt, Box prompt, and Exemplar prompt. The text prompt is a category name such as “car” and selects all instances of that category. The point prompt consists of one or more positive or negative 2D clicks and selects a single object at that location. The box prompt is a 2D bounding box that selects the object inside the box. The exemplar prompt is a 2D box treated as a visual exemplar and detects visually similar objects.

For WildDet3D-Bench, the paper emphasizes text prompt and box prompt evaluation. Depth cues are not treated as prompts in the same sense, but the model can also take optional partial or full depth maps at inference time. These depth cues may come from LiDAR, stereo, ToF, or ground-truth depth in controlled evaluation.

The WildDet3D architecture has three main components. The first is a dual-vision encoder with two parallel streams: an Image encoder using SAM3-style ViT-H + SimpleFPN to extract semantic image features, and an RGBD encoder using DINOv2 ViT-L/14 operating on 4-channel RGBD input to produce depth latents. The RGBD encoder can operate even without external depth by using a zero-filled depth channel.

The second component is a Depth fusion module, which fuses depth latents into visual features through a ControlNet-style residual add:

$\mathbf{V}' = \mathbf{V} + \text{Conv}_{1\times1}\!\big(\text{LN}(\mathbf{Z}_d^{\uparrow})\big).$

Here, $\mathbf{Z}_d^{\uparrow}$ is the upsampled depth latent, LN normalizes depth latents, and the $_\text{common}$ 0 convolution is zero-initialized. The stated purpose is to make depth optional and allow the model to degrade gracefully to monocular mode.

The third component is a Promptable detector + 3D head. The promptable detector encodes text tokens, point/box geometry, and exemplar prompts. The 3D head then lifts prompted 2D queries into 3D using camera intrinsics / ray features, depth latents, and a 3D regression head, outputting 3D center, dimensions, orientation, and confidence score. The 3D box is represented by a 3D center, dimensions $_\text{common}$ 1, orientation $_\text{common}$ 2, and confidence score, with the regression target encoded as a 12D vector consisting of center offset, log depth, log dimensions, and a 6D rotation representation (Huang et al., 9 Apr 2026).

The reported design choices are also part of the benchmark context because they explain how promptability is operationalized. The paper states that Joint 2D and 3D prediction is crucial, one-to-many matching improves supervision, deep supervision at each decoder layer helps convergence, and auxiliary depth and 2D detection losses significantly improve 3D performance.

5. Metrics, formulas, and reported benchmark results

The headline WildDet3D-Bench results reported for the full WildDet3D model trained on Omni3D, supplementary datasets (“Others”), and WildDet3D-Data are 22.6 AP $_\text{common}$ 3 for Text prompt and 24.8 AP $_\text{common}$ 4 for Box prompt (Huang et al., 9 Apr 2026). The benchmark compares WildDet3D against 3D-MOOD for text prompts and OVMono3D-LIFT and DetAny3D for box prompts.

Setting	Method	Result
Text prompt	3D-MOOD	2.3 AP
Text prompt	WildDet3D trained only on Omni3D	6.8 AP
Text prompt	WildDet3D + extra data	22.6 AP $_\text{common}$ 5
Box prompt	OVMono3D-LIFT	7.7 AP
Box prompt	DetAny3D	7.8 AP
Box prompt	WildDet3D trained only on Omni3D	8.4 AP
Box prompt	WildDet3D + extra data	24.8 AP $_\text{common}$ 6

For the full model with extra data, the frequency-split results are reported as follows. Under Text prompt, the benchmark reports rare: 28.3, common: 21.6, frequent: 18.7, and overall: 22.6. Under Box prompt, it reports rare: 30.0, common: 24.2, frequent: 20.3, and overall: 24.8. The paper notes that the rare-category numbers are especially notable because they show strong long-tail generalization.

WildDet3D-Bench also quantifies the effect of depth at inference time. Adding GT depth yields 41.6 AP for Text prompt and 47.2 AP for Box prompt, and the paper reports an average gain of +20.7 AP across settings (Huang et al., 9 Apr 2026). This indicates that the benchmark is not purely a monocular stress test in the narrow sense; it also measures whether a detector can exploit auxiliary geometry without changing the underlying architecture.

Several scoring formulas are reported for the WildDet3D system. The 3D confidence target is

$_\text{common}$ 7

with $_\text{common}$ 8, and

$_\text{common}$ 9

The final detection score is

$_\text{frequent}$ 0

with $_\text{frequent}$ 1.

The paper also reports ODS for zero-shot transfer on Argoverse 2 and ScanNet:

$_\text{frequent}$ 2

Although ODS is not the WildDet3D-Bench metric itself, its inclusion situates the benchmark within a broader evaluation framework for open-world transfer.

6. Relation to prior benchmarks, limitations, and implications

WildDet3D-Bench is distinguished from prior 3D detection benchmarks along several axes. Relative to Omni3D, which is broad relative to older 3D benchmarks, the paper states that Omni3D still has only 98 categories, is anchored in a limited set of datasets and domains, and does not stress open-vocabulary scale to the same degree. By contrast, WildDet3D-Bench has 700+ categories, is explicitly in-the-wild, and includes broader scene diversity and stronger long-tail behavior (Huang et al., 9 Apr 2026).

Relative to domain-specific benchmarks such as KITTI, nuScenes, SUNRGBD, ScanNet, and ARKitScenes, WildDet3D-Bench measures transfer across indoor, urban, nature, cluttered consumer imagery, and mixed real-world scenes. The benchmark is therefore intended to evaluate open-vocabulary category recognition, long-tail detection, prompt flexibility, localization in clutter, scale ambiguity, orientation ambiguity, partial annotation and ignore-region handling, and the benefit from optional depth cues.

The paper is explicit about limitations. Camera intrinsics predicted by the model are less accurate than GT calibration; single-image depth ambiguity remains a fundamental problem; rotation estimation is still the weakest part of the box prediction; the model is computationally heavy because it uses dual backbones; performance on rare categories is still weaker than on common categories; and the system is not intended for safety-critical deployment (Huang et al., 9 Apr 2026). Common failure modes include incorrect depth on far or occluded objects, unstable rotation for symmetric or partially visible objects, reduced accuracy when calibration is missing, and long-tail categories with very few examples.

The ablation findings reinforce the benchmark’s methodological claims. The paper reports that box prompts outperform text prompts when no depth is given, suggesting that 2D localization remains a bottleneck for text-only open-vocabulary detection. On WildDet3D-Bench, this is reflected in 22.6 for the full model with text prompts versus 24.8 with box prompts. The ablations on Omni3D further show that removing the 2D head causes a huge drop, removing the 3D confidence head hurts performance modestly, removing one-to-many matching hurts more, and removing geometry loss also hurts, while deep supervision and ignore-aware suppression help a bit. The paper’s stated takeaway is that joint 2D+3D prediction and good geometric supervision are essential.

Taken together, WildDet3D-Bench reorients monocular 3D detection evaluation away from narrow, closed-set settings toward a regime characterized by many categories, diverse scenes, flexible prompts, partial annotations, and optional depth cues. A plausible implication is that future progress in 3D detection in the wild will require better open-vocabulary supervision, stronger handling of prompt diversity, more reliable depth and calibration estimation, and larger, more verified real-world datasets.

Markdown Report Issue Upgrade to Chat

References (1)

WildDet3D: Scaling Promptable 3D Detection in the Wild (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WildDet3D-Bench.