Papers
Topics
Authors
Recent
Search
2000 character limit reached

BoxerNet: 2D-to-3D Object Localization Transformer

Updated 3 July 2026
  • BoxerNet is a transformer-based neural architecture that lifts 2D detections to gravity-aligned 7-DoF 3D bounding boxes by fusing RGB features with sparse depth and ray tokens.
  • It employs a multi-stage pipeline with input tokenization, self-attention encoding, cross-attention decoding, and dual output heads for box regression and uncertainty estimation.
  • BoxerNet achieves significant mAP improvements over baselines on diverse datasets, demonstrating robust open-world 3D localization for AR, robotics, and scene understanding.

BoxerNet is a transformer-based neural architecture designed for robust lifting of 2D bounding box object detections to metric 3D bounding boxes in open-world scenes. As the central module in the Boxer system, BoxerNet focuses exclusively on the 2D-to-3D problem, operating on input from arbitrary open-set 2D detectors, posed RGB images, and, optionally, geometric cues in the form of sparse point clouds or dense depth. It produces gravity-aligned 7-degree-of-freedom (DoF) 3D bounding boxes with associated aleatoric uncertainty estimates, making it suitable for scalable, annotation-efficient object localization across a broad object vocabulary.

1. Network Architecture

BoxerNet processes input in multiple stages: input tokenization, transformer-based encoding, box-conditioned decoding, and multi-head output regression.

Input Tokenization. BoxerNet employs a DINOv3 “Base” backbone to extract H′×W′ visual tokens from the input image, with typical configuration H′=30, W′=30, and feature dimension D=768 for 960×960 input resolution. Geometric cues are incorporated as follows:

  • Depth tokens: Each spatial patch receives a single scalar representing median depth, computed per patch from projected sparse SLAM points or dense depth maps. If no depth falls within a patch, the depth value is set to −1.
  • Ray tokens: Each patch center is unprojected to a camera-frame, unit-length ray encoding all intrinsic and orientation information.
  • Patch-wise tokens concatenate RGB features, depth, and ray information into x_{i,j} ∈ ℝ{772}.

Self-Attention Encoder. A set of four transformer encoder layers (12 heads, hidden dimension 768) perform self-attention over the patch tokens, each followed by a 2-layer MLP (ReLU, 128 hidden). No extra positional encoding is added; ray tokens serve as a spatial anchor.

2D-to-3D Cross-Attention Decoder. Each input 2D bounding box is linearly projected to a 768-dimensional token. Six cross-attention transformer layers enable each box token to attend to the encoder outputs independently, without self-attention among boxes, ensuring permutation-invariant decoding. Each layer is paired with a 2-layer MLP.

Output Heads. Two independent output branches, each a 2-layer MLP, produce (a) regression predictions for the 7-DoF 3D box parameters and (b) a scalar log variance (aleatoric uncertainty). The final detection confidence s_k is the mean of the off-the-shelf 2D detection score and a 3D regressor-derived score given by sigmoid of the negative predicted log variance.

2. 3D Bounding Box Parameterization and Loss

BoxerNet directly regresses the parameters of a 7-DoF, gravity-aligned 3D bounding box:

(x,y,z,w,h,d,θ),(x, y, z, w, h, d, \theta),

where (x,y,z)(x, y, z) is the box center in meters, (w,h,d)(w, h, d) the extents in meters, and θ\theta the yaw about gravity. No constraints are imposed except as induced by the loss function.

Aleatoric Uncertainty Regression Loss. The regression loss is a variant of the symmetric Chamfer distance between eight predicted and ground-truth box corners, modulated by a learned uncertainty term (following the CuTR formulation and Kendall & Gal 2017):

L=Lchamferexp(σ^)+σ^.\mathcal{L} = \mathcal{L}_{\text{chamfer}} \cdot \exp(-\hat{\sigma}) + \hat{\sigma}.

This encourages high predicted uncertainty σ^\hat{\sigma} in ambiguous regions, at the risk of higher loss if predictions deviate from ground truth.

3. Flexible Geometric Encoding

To operate with heterogeneous geometric cues, BoxerNet uses median depth patch encoding, calculated as:

Fi,jdepth={median{zpp projects into patch (i,j)},#{p}1 1,no points in the patch.F^{\mathrm{depth}}_{i,j} = \begin{cases} \mathrm{median}\{z_p \mid p \text{ projects into patch } (i, j)\}, & \#\{p\} \geq 1 \ -1, & \text{no points in the patch}. \end{cases}

This minimalist encoding, concatenated to image features and ray tokens, allows the transformer to exploit both dense and sparse geometric information, optimizing performance under varying sensor and environmental conditions.

4. Multi-View Fusion and Postprocessing

After per-frame inference, BoxerNet predictions are globally fused across views and time using sequential filtering and clustering steps:

  • 3D IoU Filtering: Pairs of 3D boxes with IoU3D0.3_{3D} \geq 0.3 are linked.
  • Semantic Filtering: Class prompt similarities (computed with Sentence-BERT) above threshold further prune associations.
  • Clustering: Connected components in this variable-edge graph define object-level clusters.
  • Rotation-Aware Averaging: Box centers and extents are merged using confidence-weighted means; yaw averaging handles 90° symmetry by allowing (w,d)(w,d) swaps if variance is reduced.
  • 3D Non-Maximum Suppression: Boxes with 3D IoU 0.6\geq 0.6 are suppressed within clusters, keeping the highest-confidence hypothesis.

This pipeline yields globally consistent, de-duplicated 3D bounding boxes for downstream use.

5. Training Protocol and Data Regime

BoxerNet is trained on a combination of internal and public datasets, encompassing ≈1.22 million unique 3D bounding boxes and 42.1 million annotated image views from sources such as Project Aria Gen1, Quest3, NymeriaPlus, CA-1M, ScanNet (via Scan2CAD), and SUN-RGBD. The object vocabulary spans ≈1,200+ class prompts sourced from LVIS and additional indoor concepts.

Data Augmentation:

  • Photometric jitter (brightness, contrast, blur, gamma)
  • Camera perturbation (randomized focal length/principal point)
  • Depth dropout (at patch or point level)
  • 2D-box tightening (projecting ground-truth 3DBB to 2D, then aligning to SAM masks)

Optimization:

  • AdamW optimizer
  • Learning rate: initial (x,y,z)(x, y, z)0, cosine decay to (x,y,z)(x, y, z)1
  • Training duration: ~2 weeks on 16×H100 GPUs
  • Input: 960×960, bfloat16, ~25M BoxerNet parameters
  • Inference: ≈20 ms/frame on RTX 4090

6. Quantitative Results and Comparative Analysis

BoxerNet achieves superior performance over CuTR and state-of-the-art baselines across a range of scenarios and datasets. Key results include:

Scenario Baseline (mAP) BoxerNet (mAP)
Per-frame, NymeriaPlus (GT2D) 0.010 0.296
Per-frame, CA-1M + depth (GT2D) 0.250 0.412
Per-scene fusion, CA-1M (GT2D) 0.305 0.434
Per-scene fusion, NymeriaPlus (OWLv2) 0.013 0.145
  • BoxerNet retains substantial advantage with only RGB in egocentric settings (0.296 vs. 0.010 mAP on NymeriaPlus).
  • With dense depth (CA-1M), mAP improves further (0.412 vs. 0.250 for GT2D).
  • Fused predictions with multi-view postprocessing yield globally consistent object reconstructions and outperform baselines by wide margins, supporting applicability in AR, robotics, and open-world 3D scene understanding (DeTone et al., 6 Apr 2026).

7. Significance and Applicability

BoxerNet demonstrates that accurate, robust open-world 3D object localization can be achieved by (1) leveraging state-of-the-art open-vocabulary 2D detectors rather than learning 2D semantics from scratch; (2) using a generic and scalable transformer-based encoder capable of ingesting both visual and geometric tokens; and (3) integrating aleatoric uncertainty for robust regression. The modularity of the system and its ability to operate with either sparse or dense geometric cues enable practical deployment in unconstrained environments. Outputs from BoxerNet are directly compatible with downstream AR, robotics, and general 3D scene understanding pipelines, closing the gap between large-scale 2D open-vocabulary detection and real-world spatial perception.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BoxerNet.