Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MonoRUn: Monocular 3D Object Detection by Reconstruction and Uncertainty Propagation (2103.12605v2)

Published 23 Mar 2021 in cs.CV

Abstract: Object localization in 3D space is a challenging aspect in monocular 3D object detection. Recent advances in 6DoF pose estimation have shown that predicting dense 2D-3D correspondence maps between image and object 3D model and then estimating object pose via Perspective-n-Point (PnP) algorithm can achieve remarkable localization accuracy. Yet these methods rely on training with ground truth of object geometry, which is difficult to acquire in real outdoor scenes. To address this issue, we propose MonoRUn, a novel detection framework that learns dense correspondences and geometry in a self-supervised manner, with simple 3D bounding box annotations. To regress the pixel-related 3D object coordinates, we employ a regional reconstruction network with uncertainty awareness. For self-supervised training, the predicted 3D coordinates are projected back to the image plane. A Robust KL loss is proposed to minimize the uncertainty-weighted reprojection error. During testing phase, we exploit the network uncertainty by propagating it through all downstream modules. More specifically, the uncertainty-driven PnP algorithm is leveraged to estimate object pose and its covariance. Extensive experiments demonstrate that our proposed approach outperforms current state-of-the-art methods on KITTI benchmark.

Citations (118)

Summary

  • The paper introduces a novel approach that reconstructs dense 3D object coordinates from 2D detections while propagating uncertainty to optimize self-supervised learning.
  • The method employs a Robust KL Loss to focus on low-uncertainty foreground pixels, mitigating errors from background interference during reconstruction.
  • Evaluated on the KITTI benchmark, the approach achieves state-of-the-art performance in Car detection with efficient runtime and reliable pose covariance estimates.

MonoRUn addresses the challenging problem of monocular 3D object detection, particularly the difficulty of accurate object localization from a single image and the need for detailed supervision data like ground truth 3D models or keypoints in many existing methods. The paper proposes a novel framework that leverages dense 2D-3D correspondence mapping learned through self-supervised reconstruction, coupled with uncertainty estimation and propagation.

The core idea is to extend an off-the-shelf 2D object detector by adding a 3D branch that operates on the Region of Interest (RoI) features within predicted 2D bounding boxes. This 3D branch predicts dense 3D object coordinates for pixels within the RoI, effectively reconstructing the object's geometry and establishing 2D-3D correspondences.

A key challenge in self-supervised reconstruction is handling the background pixels within the RoI, which do not belong to the object and would introduce large errors if simply used for supervision. MonoRUn addresses this by incorporating uncertainty awareness. The network estimates the aleatoric uncertainty (data-dependent noise) of its predictions for the reprojected 2D coordinates. This allows the self-supervision signal to focus on low-uncertainty pixels, which are likely foreground.

For self-supervised training, the predicted 3D object coordinates are projected back to the image plane using ground truth object pose and camera intrinsic parameters. The loss is designed to minimize the error between these reprojected 2D coordinates and their original image positions. To handle the uncertainty and improve training robustness, the paper introduces the Robust KL Loss (LRKLL_\text{RKL}). This loss is based on the KL divergence between a predicted Gaussian distribution (with learned mean and variance) and a target Dirac distribution. It combines aspects of Gaussian and Laplacian KL losses to be more robust to outliers (like Huber loss) and includes a weight normalization mechanism to prevent issues with decaying uncertainty weights during training.

The network architecture includes two main branches on top of the 2D detector's features:

  1. Global Extractor: Takes 7x7 RoI features, flattens them, and uses fully connected layers to predict the object's 3D dimensions and a global latent vector.
  2. NOC Decoder: Uses convolutional layers and integrates the global latent vector (similar to Squeeze-Excitation networks) to predict dense Normalized Object Coordinates (xNOC\mathbf{x}^\text{NOC}) and the aleatoric uncertainty (standard deviations of reprojected 2D coordinates). The final 3D object coordinates (xOC\mathbf{x}^\text{OC}) are computed by element-wise multiplying xNOC\mathbf{x}^\text{NOC} by the predicted dimensions d\mathbf{d}.

During the testing phase, MonoRUn exploits the estimated uncertainty. It uses an uncertainty-driven PnP algorithm to estimate the object's 6DoF pose (p\mathbf{p}) from the dense 2D-3D correspondences. This involves solving a Maximum Likelihood Estimation problem where reprojection errors are weighted by the inverse of their predicted variance. Furthermore, the network uncertainty is propagated through the PnP process to estimate the pose covariance matrix (Σp\mathbf{\Sigma}_{\mathbf{p}^*}), providing a measure of localization uncertainty. Online covariance calibration is used to make the predicted uncertainty more reliable. Epistemic uncertainty (model uncertainty) is estimated using Monte Carlo dropout, specifically applied to the global extractor or the full reconstruction network, and combined with the aleatoric uncertainty.

A separate Scoring Head, implemented as an MLP, predicts a 3D localization confidence score c3DLocc_\text{3DLoc}. This head takes both the estimated pose uncertainty and the global features as input, learning to predict a score correlated with the 3D IoU between the predicted and ground truth boxes. This score is combined with the 2D detection score.

The network can be trained in different setups: fully self-supervised (using only reprojection loss), with additional LiDAR supervision (using sparse ground truth NOCs from LiDAR points), or with end-to-end PnP training (using differentiable PnP and applying loss directly on pose error). Ablation studies show that while self-supervision alone is effective, combining it with LiDAR supervision yields the best performance. The proposed Robust KL Loss is shown to be superior to standard L1/L2 or Laplacian KL losses for this task. End-to-end training, despite being theoretically sound, did not outperform the self-supervised approach with the Robust KL Loss.

Evaluated on the KITTI-Object benchmark, MonoRUn achieves state-of-the-art results for Car detection, particularly for the more challenging moderate and hard difficulties. It also shows competitive performance on Pedestrian detection. A significant advantage is its runtime efficiency compared to methods relying on pre-computed depth maps. The framework demonstrates that accurate monocular 3D detection is achievable in real driving scenes using dense correspondence learned via self-supervision, and highlights the value of incorporating uncertainty estimation throughout the pipeline, from reconstruction to pose estimation and scoring. The explicit estimation of pose covariance opens possibilities for downstream tasks that require probabilistic localization.