3D Monocular Open-set Detector (3D-MOOD)

Updated 3 July 2026

The paper introduces an end-to-end monocular open-set 3D detector that couples open-vocabulary 2D detection with direct 3D cuboid estimation.
It employs geometry-aware queries and differentiable lifting pipelines to transform 2D detections into accurate metric 3D localizations.
Evaluation on diverse benchmarks shows state-of-the-art performance, with improved closed-set and open-set AP metrics across varied domains.

A 3D Monocular Open-set Object Detector (3D-MOOD) addresses the task of open-set instance recognition and spatial localization of objects in 3D from a single RGB image, without restriction to a pre-defined category set or fixed scenes. Such systems establish a new paradigm in computer vision by coupling open-vocabulary 2D detection with 3D cuboid estimation, enabling both closed- and open-set generalization across domains, classes, and environments. 3D-MOOD methods integrate text-conditioned object queries, geometric priors, robust lifting pipelines, and end-to-end joint 2D–3D optimization, achieving leading performance on highly diverse and cross-dataset benchmarks (Yang et al., 31 Jul 2025, Yao et al., 2024, Huang et al., 2024).

1. Problem Setting: Monocular Open-set 3D Detection

Monocular open-set 3D object detection is formalized as follows: Given an input image $I \in \mathbb{R}^{H \times W \times 3}$ and a text prompt $T = [c_1,\dots, c_M]$ defining a vocabulary $\mathcal{C}$ of possible object classes—including both base (seen) and novel (unseen) categories—a detector must produce a list of $N$ instances

$\left\{ (c_i, B_i, s_i) \right\}_{i=1}^N,\qquad c_i \in \mathcal{C},\quad B_i = (t_i, d_i, r_i)$

where $t_i \in \mathbb{R}^3$ (3D center), $d_i \in \mathbb{R}^3$ (dimensions), $r_i \in \mathbb{R}^6$ (SO(3) encoding), and $s_i \in [0,1]$ (confidence). The open-set constraint mandates that $\mathcal{C}$ may include categories never observed in 3D during training, and the detector is expected to localize such objects in metric 3D coordinates (Yao et al., 2024).

This problem departs from prior closed-set 3D detectors, which are generally restricted to categories and domains seen in training, and require extensive 3D-labelled data. Open-set approaches enable zero-shot detection for novel categories, typically through the integration of pretrained open-vocabulary 2D detectors, class-agnostic lifting mechanisms, and open-vocabulary classification heads (Yao et al., 2024, Yang et al., 31 Jul 2025, Huang et al., 2024).

2. Architectural Principles of State-of-the-Art 3D-MOOD

3D-MOOD (Yang et al., 31 Jul 2025) encapsulates a fully end-to-end design that couples open-vocabulary 2D object detection with a geometric lifting module, yielding 3D box predictions directly in camera coordinates. The key architectural components include:

Input: An RGB image $T = [c_1,\dots, c_M]$ 0 and a set of language prompts $T = [c_1,\dots, c_M]$ 1 (“detect < object class >”).
2D Open-set Backbone: A transformer-based open-vocabulary detector (Grounding-DINO style), comprising a Swin Transformer image encoder, BERT text encoder, and multi-layer fusion producing 2D object queries $T = [c_1,\dots, c_M]$ 2.
3D Bounding-Box Head: Stacked MLPs operate per decoder layer, taking 2D or geometry-aware 3D queries $T = [c_1,\dots, c_M]$ 3 and outputting a 12D vector (projected center offsets, scaled log-depth, log-dimensions, 6D rotation).
Differentiable Lifting: A function $T = [c_1,\dots, c_M]$ 4 combines predicted 2D box, camera intrinsics $T = [c_1,\dots, c_M]$ 5, and 3D parameters to yield a 3D box $T = [c_1,\dots, c_M]$ 6 camera coordinates.
Geometry-aware 3D Queries: Each $T = [c_1,\dots, c_M]$ 7 is conditioned via cross-attention on camera intrinsics and transformer depth features to form $T = [c_1,\dots, c_M]$ 8.
Auxiliary Depth Head: Parallel FPN+transformer branch predicts a dense metric depth map $T = [c_1,\dots, c_M]$ 9 for dense supervision.
End-to-End Training: 2D detection losses (classification and box regression), 3D box losses, and a scale-invariant log depth loss are jointly optimized.

This design achieves tight coupling of 2D instance recognition and 3D localization, providing superior robustness and generalizability compared to conventional two-stage or pseudo-labeling pipelines (Yang et al., 31 Jul 2025).

3. Lifting Procedures and Geometric Conditioning

Central to 3D-MOOD and related methods is the geometric lifting of 2D region proposals or queries into 3D cuboid space. The process involves:

For a 2D detection $\mathcal{C}$ 0 with projected center $\mathcal{C}$ 1, predicted offsets $\mathcal{C}$ 2, and scaled log-depth $\mathcal{C}$ 3 (with scale $\mathcal{C}$ 4), the 3D center is computed:

$\mathcal{C}$ 5

$\mathcal{C}$ 6

Object dimensions $\mathcal{C}$ 7, $\mathcal{C}$ 8, $\mathcal{C}$ 9 are recovered as exponentiated normalized MLP outputs.
6D rotation (rot $N$ 0) is mapped onto SO(3) via the standard 6D-to-rotation conversion (Yang et al., 31 Jul 2025).

Geometry-aware conditioning involves modifying $N$ 1 with camera intrinsics and learned global depth features via cross-attention mechanisms, producing $N$ 2, which is more robust to intrinsics/scenario shifts and aids generalization in open-set domains.

Auxiliary supervision is provided by a dense metric depth prediction head, furnishing pixel-wise depth targets and stabilizing 3D box learning, particularly in the low-data or distribution-shift regime.

4. Canonical Image Space and Cross-Dataset Generalization

A key challenge in multi-dataset monocular 3D detection is resolving the ambiguity in depth and geometric scale arising from images of diverse resolutions and camera intrinsics. 3D-MOOD addresses this by enforcing a canonical image space during both training and inference:

Each image is resized/scaled to a fixed canonical resolution (e.g., $N$ 3), and camera intrinsics are transformed accordingly to preserve projection properties.
Cropping and center-padding yield consistent $N$ 4, eliminating geometric ambiguity and improving cross-dataset and open-domain transfer.

Ablation studies confirm that this canonical treatment yields a substantial boost to both closed- and open-set AP metrics, e.g., $N$ 5 AP on Omni3D and $N$ 6 AP on ODS open settings (Yang et al., 31 Jul 2025). This setup enables efficient and accurate joint training over diverse datasets spanning multiple environments, from indoor scenes (ScanNet, ARKitScenes) to outdoor domains (KITTI, nuScenes).

5. Loss Functions and Training Strategies

3D-MOOD employs joint optimization of detection and depth objectives across multiple decoder layers. The total loss aggregated for layers $N$ 7 is

$N$ 8

Where:

$N$ 9 1D box regression $\left\{ (c_i, B_i, s_i) \right\}_{i=1}^N,\qquad c_i \in \mathcal{C},\quad B_i = (t_i, d_i, r_i)$ 0 GIoU $\left\{ (c_i, B_i, s_i) \right\}_{i=1}^N,\qquad c_i \in \mathcal{C},\quad B_i = (t_i, d_i, r_i)$ 1 a contrastive classification loss.
$\left\{ (c_i, B_i, s_i) \right\}_{i=1}^N,\qquad c_i \in \mathcal{C},\quad B_i = (t_i, d_i, r_i)$ 2 for all geometric parameters $\left\{ (c_i, B_i, s_i) \right\}_{i=1}^N,\qquad c_i \in \mathcal{C},\quad B_i = (t_i, d_i, r_i)$ 3.
$\left\{ (c_i, B_i, s_i) \right\}_{i=1}^N,\qquad c_i \in \mathcal{C},\quad B_i = (t_i, d_i, r_i)$ 4 is a scale-invariant log depth loss for the dense pixel-wise branch.
$\left\{ (c_i, B_i, s_i) \right\}_{i=1}^N,\qquad c_i \in \mathcal{C},\quad B_i = (t_i, d_i, r_i)$ 5; all other weights are set to $\left\{ (c_i, B_i, s_i) \right\}_{i=1}^N,\qquad c_i \in \mathcal{C},\quad B_i = (t_i, d_i, r_i)$ 6 (Yang et al., 31 Jul 2025).

The model is trained via AdamW, with large-batch schedules and extensive image-level augmentations (random scaling, flips). Backbones include Swin-T and Swin-B variants, and frameworks such as Vis4D on PyTorch are used in large-scale experiments.

6. Representative Methods and Baselines

A number of methods form the landscape for open-set monocular 3D object detection, all leveraging a fusion of pretrained open-vocabulary 2D detectors and 3D parameter estimation:

Method	2D Detector	3D Lifting	Open-Vocab	Pseudo-labels
Cube R-CNN	Standard 2D backbone	Regressor head	Closed-set	No
OVM3D-Det (Huang et al., 2024)	Grounded-SAM	Cube R-CNN head	Yes	Yes (pseudo-LiDAR)
OVMono3D-LIFT (Yao et al., 2024)	Grounding DINO	Cube/cube head	Yes	No
3D-MOOD (Yang et al., 31 Jul 2025)	Grounding DINO style	End-to-end MLP	Yes	No

OVM3D-Det (Huang et al., 2024) relies on a full pipeline of open-vocabulary 2D detection and segmentation, monocular depth, pseudo-LiDAR generation with adaptive erosion, PCA box fitting with LLM priors, and final Cube R-CNN open-vocab joint training.

OVMono3D-LIFT (Yao et al., 2024) decouples recognition (image-text open-vocabulary) and localization (class-agnostic 3D lifting), using a learned cube head to directly regress 3D box parameters from ROI features.

3D-MOOD (Yang et al., 31 Jul 2025) realizes a tightly-coupled, end-to-end transformer-based architecture, fusing geometric priors, text conditioning, and depth supervision for robust cross-domain open-set 3D detection.

7. Evaluation Protocols and Quantitative Benchmarks

Comprehensive evaluation employs mean average precision on 3D boxes ( $\left\{ (c_i, B_i, s_i) \right\}_{i=1}^N,\qquad c_i \in \mathcal{C},\quad B_i = (t_i, d_i, r_i)$ 7) and specialized open-set metrics such as ODS (open-domain score) and normalized-distance $\left\{ (c_i, B_i, s_i) \right\}_{i=1}^N,\qquad c_i \in \mathcal{C},\quad B_i = (t_i, d_i, r_i)$ 8. Protocols are tailored to the open-vocabulary regime: ground truth is limited to prompted/labeled classes per image to mitigate false penalization due to missing annotation (Yao et al., 2024).

Closed-set benchmarks (e.g., Omni3D):

Cube R-CNN: $\left\{ (c_i, B_i, s_i) \right\}_{i=1}^N,\qquad c_i \in \mathcal{C},\quad B_i = (t_i, d_i, r_i)$ 9 AP $t_i \in \mathbb{R}^3$ 0
Uni-MODE: $t_i \in \mathbb{R}^3$ 1
3D-MOOD (Swin-T): $t_i \in \mathbb{R}^3$ 2
3D-MOOD (Swin-B): $t_i \in \mathbb{R}^3$ 3 (state of the art) (Yang et al., 31 Jul 2025)

Open-set, cross-domain results:

Argoverse 2 (normalized-distance $t_i \in \mathbb{R}^3$ 4, ODS):
- Cube R-CNN: $t_i \in \mathbb{R}^3$ 5, $t_i \in \mathbb{R}^3$ 6
- OVM3D-Det: $t_i \in \mathbb{R}^3$ 7, $t_i \in \mathbb{R}^3$ 8
- 3D-MOOD (Swin-B): $t_i \in \mathbb{R}^3$ 9, $d_i \in \mathbb{R}^3$ 0 (Base: $d_i \in \mathbb{R}^3$ 1, Novel: $d_i \in \mathbb{R}^3$ 2)
ScanNet (normalized-distance $d_i \in \mathbb{R}^3$ 3, ODS):
- Cube R-CNN: $d_i \in \mathbb{R}^3$ 4, $d_i \in \mathbb{R}^3$ 5
- OVM3D-Det: $d_i \in \mathbb{R}^3$ 6, $d_i \in \mathbb{R}^3$ 7
- 3D-MOOD (Swin-B): $d_i \in \mathbb{R}^3$ 8, $d_i \in \mathbb{R}^3$ 9 (Base: $r_i \in \mathbb{R}^6$ 0, Novel: $r_i \in \mathbb{R}^6$ 1) (Yang et al., 31 Jul 2025)

Ablation studies validate that canonical image space, geometry-aware queries, and auxiliary depth supervision each make measurable contributions to final performance.

8. Impact, Challenges, and Significance

The advent of 3D Monocular Open-set Object Detectors, exemplified by 3D-MOOD, marks a significant progression in scalable visual understanding. These systems:

Remove the requirement of exhaustive 3D annotation for every class/domain.
Provide unified joints for 2D open-vocabulary recognition and 3D metric localization.
Exhibit strong generalization abilities—zero-shot or cross-domain transfer—via geometry-aware representations and depth priors.
Achieve state-of-the-art open-set AP metrics across a variety of in- and out-of-distribution datasets.

However, challenges persist, notably in reliable 3D estimation for distant, small, or heavily occluded novel objects; precision of monocular depth predictors; potential annotation gaps; and robustness to diverse real-world camera intrinsics and scene geometries (Yang et al., 31 Jul 2025, Huang et al., 2024). The canonical image space, auxiliary dense depth branches, and open-vocabulary text alignment heads represent current leading approaches for mitigating these issues.

Ongoing research focuses on improved cross-dataset adaptation, novel-class 3D recognition, and leveraging synthetic or large-scale weakly-labelled data to further advance the reliability and utility of monocular open-set 3D detectors.

Markdown Report Issue Upgrade to Chat

References (3)

3D-MOOD: Lifting 2D to 3D for Monocular Open-Set Object Detection (2025)

Open Vocabulary Monocular 3D Object Detection (2024)

Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to 3D Monocular Open-set Object Detector (3D-MOOD).