BoxDreamer: Dreaming Box Corners for Generalizable Object Pose Estimation

Published 10 Apr 2025 in cs.CV | (2504.07955v1)

Abstract: This paper presents a generalizable RGB-based approach for object pose estimation, specifically designed to address challenges in sparse-view settings. While existing methods can estimate the poses of unseen objects, their generalization ability remains limited in scenarios involving occlusions and sparse reference views, restricting their real-world applicability. To overcome these limitations, we introduce corner points of the object bounding box as an intermediate representation of the object pose. The 3D object corners can be reliably recovered from sparse input views, while the 2D corner points in the target view are estimated through a novel reference-based point synthesizer, which works well even in scenarios involving occlusions. As object semantic points, object corners naturally establish 2D-3D correspondences for object pose estimation with a PnP algorithm. Extensive experiments on the YCB-Video and Occluded-LINEMOD datasets show that our approach outperforms state-of-the-art methods, highlighting the effectiveness of the proposed representation and significantly enhancing the generalization capabilities of object pose estimation, which is crucial for real-world applications.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces BoxDreamer, which uses 3D bounding box corners as a novel intermediate representation for robust 6D object pose estimation.
It integrates a Transformer-based architecture with heatmap predictions to effectively handle sparse views and severe occlusions.
Experimental results demonstrate significant performance gains over baselines on challenging datasets like LINEMOD and Occluded-LINEMOD.

This paper introduces BoxDreamer, a novel framework for generalizable 6D object pose estimation from RGB images, specifically designed to perform well even with sparse reference views and significant occlusions (2504.07955). Existing methods, categorized as retrieval-based or matching-based, struggle in these challenging conditions. Retrieval-based methods fail when dense reference views are unavailable or when occlusion prevents finding a good match. Matching-based methods require complete point clouds, which are difficult to reconstruct accurately from sparse views, and matching fails under occlusion.

BoxDreamer proposes using the eight corners of an object's 3D bounding box as a robust intermediate representation for pose estimation. The core idea is that these 3D corners can be reliably estimated even from sparse reference views, and their 2D projections in the query view can be predicted effectively, even with occlusions.

Methodology:

3D Bounding Box Recovery: Given a sparse set of reference images $\{I_0, \dots, I_i\}$ with object detections $\{\mathcal{M}_0, \dots, \mathcal{M}_i\}$ , an off-the-shelf sparse-view reconstruction method (like DUSt3R (Dolatabadi et al., 2024)) is used to obtain a sparse point cloud $\mathbf{P}$ of the object and the corresponding reference camera poses $\{\mathbf{\xi}_0, \dots, \mathbf{\xi}_i\}$ . Points outside the object masks in the reference views are filtered out. The axis-aligned 3D bounding box $\mathbf{B}$ enclosing the filtered points $\tilde{\mathbf{P}}$ is then computed in an object-centric coordinate system.
2D Heatmap Representation: Instead of using the 3D corner coordinates directly, the 3D corners $\mathbf{B}$ are projected onto each reference image plane using the known poses $\mathbf{\xi}_i$ and camera intrinsics $K$ . To create a denser, smoother representation suitable for learning, an 8-channel heatmap $\mathbf{H}_i$ is generated for each reference view. Each channel corresponds to one corner, with heatmap values determined by a Gaussian-like function centered at the projected 2D corner location $\mathbf{b}_i = (x_i, y_i)$ . The function used is $\mathbf{H}(x, y, i) = \exp\Bigl(-\frac{\sqrt{(x - x_i)^{2} + (y - y_i)^{2}}}{2\sigma^2}\Bigr)$ , where $\sigma$ is adapted based on the object size. This provides smoother supervision than a one-hot encoding or standard CornerNet (Law et al., 2018) approach.
Query Corner Prediction: A Transformer-based architecture predicts the 2D corner heatmap $\mathbf{H}_q$ $H_{q}$ for the query image $I_q$ $I_{q}$ .
- Features $\mathbf{F}_i, \mathbf{F}_q$ are extracted from reference and query images using a pre-trained DINOv2 (Oquab et al., 2023) backbone.
- Reference heatmaps $\mathbf{H}_i$ are patched and linearly projected to match the feature dimension $d$ .
- Projected heatmap tokens are added element-wise to the corresponding reference image feature tokens: $\mathbf{F}_i' = \mathbf{F}_i + \text{Linear}(\mathbf{H}^p_i)$ .
- For the query image, learned query tokens $\mathbf{Q}$ are used instead of heatmap tokens.
- Flattened reference tokens ( $\mathbf{F}_i'$ ) and query tokens ( $\mathbf{F}_q + \mathbf{Q}$ ) are concatenated and fed into a Transformer decoder (12 layers, 768 dimensions, 8 heads).
- The output query features $\mathbf{F}_q'$ are linearly projected back to the heatmap dimension and unpatchified to produce the final predicted query heatmap $\mathbf{H}_q = \text{Sigmoid}(\text{Linear}(\mathbf{F}_q'))$ .
Pose Estimation: The 2D corner locations $\hat{\mathbf{b}}_q$ are extracted from the peaks of the predicted heatmap $\mathbf{H}_q$ . These predicted 2D corners, along with the recovered 3D bounding box corners $\mathbf{B}$ , form 2D-3D correspondences. The final 6DoF object pose $\mathbf{\xi}_q$ is recovered using a standard Perspective-n-Point (PnP) algorithm.

Training and Implementation:

Trained on Objaverse (Deitke et al., 2022, Deitke et al., 2023) (synthetic) and OnePose (Vice et al., 2022) (real) datasets.
Supervised using Smooth L1 loss on both the predicted heatmaps ( $L_{\text{coarse}}$ ) and the extracted corner coordinates ( $L_{\text{fine}}$ ), combined as $L = L_{\text{coarse}} + \lambda L_{\text{fine}}$ ( $\lambda=2.0$ ).
Extensive data augmentation includes random 3D bounding box rotation, RGB noise/blur, random backgrounds (SUN2012 (Schleich et al., 2010)), and simulated occlusions.
Uses AdamW optimizer, cosine learning rate decay ( $10^{-4}$ initial), 100 epochs on 8 A100 GPUs. Reference images sampled dynamically (1-15).

Experiments and Results:

Evaluated on LINEMOD (Bouslimi et al., 2012), Occluded-LINEMOD (Vu et al., 2014), YCB-Video (Xiang et al., 2017), and OnePose-LowTexture datasets.
Compared against Gen6D (Yang et al., 2022), OnePose++ (Pan et al., 2022), Cas6D (Mathai et al., 2023), GS-Pose (Cai et al., 2024), and instance-level methods.
Significantly outperforms generalizable baselines in sparse-view settings (e.g., 5 reference views) on LINEMOD, Occluded-LINEMOD, and YCB-Video.
Demonstrates strong robustness to occlusion, often exceeding baselines significantly on Occluded-LINEMOD and YCB-Video.
Shows robustness to noise in the initial 3D bounding box reconstruction.
Achieves real-time inference speed (~23 ms per query image on an RTX 4090). The initial reconstruction (e.g., with DUSt3R) is a one-time offline cost per object (~11s for 10 views).

Contributions:

A novel generalizable object pose estimation framework excelling in sparse-view and occlusion scenarios.
The first proposal to use object bounding box corners as the core intermediate representation for this task.
An end-to-end Transformer decoder architecture that effectively predicts 2D corner projections by leveraging reference view information.

Limitations: Struggles with symmetric objects and incurs higher memory usage if dense reference views are used. Future work includes handling symmetry, optimizing dense view usage, and integrating object detection.

Markdown Report Issue