Papers
Topics
Authors
Recent
Search
2000 character limit reached

BoxerNet: Transformer-based 3D Localization

Updated 9 April 2026
  • BoxerNet is a transformer-based neural architecture that lifts 2D detections into 3D bounding boxes using multi-modal inputs such as image features, depth, and ray embeddings.
  • It employs a hierarchical encoder-decoder design with attention-based fusion to integrate geometric and visual data for robust 3D localization.
  • By decoupling 2D detection from 3D localization and using uncertainty modeling and geometric post-processing, BoxerNet achieves state-of-the-art performance in open-world settings.

BoxerNet is a transformer-based neural architecture for lifting 2D open-vocabulary object detections into 3D bounding box estimates using posed images, optional depth (dense or sparse), and geometric information. It constitutes the core of the Boxer system, which decouples 2D detection from 3D localization and achieves state-of-the-art results in metric 3D bounding box (3DBB) estimation in open-world settings. BoxerNet is characterized by a modular pipeline, attention-based fusion, robust aleatoric uncertainty modeling, and multi-view geometric post-processing (DeTone et al., 6 Apr 2026).

1. Input Representations and Preprocessing

BoxerNet inputs are constructed from multiple signals:

  • 2D proposals: Off-the-shelf open-world object detectors (e.g., DETIC, OWLv2, SAM3) generate NN 2D bounding boxes, bi2D=(x1,y1,x2,y2)b_i^{2D} = (x_1, y_1, x_2, y_2)^\top, accompanied by score si2D[0,1]s_i^{2D} \in [0,1].
  • Image features: Each input image IRH×W×3I \in \mathbb{R}^{H \times W \times 3} is processed by a DINOv3 vision backbone to extract FimgRH×W×DF^{img} \in \mathbb{R}^{H'\times W'\times D}, with D=768D=768.
  • Depth/point clouds (optional): Per-frame LiDAR, SLAM map points, or dense depth values are projected into the image and aggregated in H×WH'\times W' non-overlapping patches. Within each patch, the median camera-frame depth is computed to form FdepthRH×W×1F^{depth} \in \mathbb{R}^{H'\times W'\times 1}; patches without samples are assigned 1-1. This encoding is robust for both sparse and dense depth availability.
  • Ray embeddings: For each patch, the center pixel is mapped—using camera intrinsics and 6-DoF pose (excluding translation, working in gravity-aligned camera frame Fg\mathcal{F}_g)—to a unit 3D ray vector, aggregated as bi2D=(x1,y1,x2,y2)b_i^{2D} = (x_1, y_1, x_2, y_2)^\top0.
  • Tokenization: At every spatial location bi2D=(x1,y1,x2,y2)b_i^{2D} = (x_1, y_1, x_2, y_2)^\top1, features are concatenated: bi2D=(x1,y1,x2,y2)b_i^{2D} = (x_1, y_1, x_2, y_2)^\top2, yielding bi2D=(x1,y1,x2,y2)b_i^{2D} = (x_1, y_1, x_2, y_2)^\top3 tokens of dimension bi2D=(x1,y1,x2,y2)b_i^{2D} = (x_1, y_1, x_2, y_2)^\top4.

This multi-modal encoding conditions BoxerNet on image appearance, geometry, depth/scale hints, and camera calibration.

2. Transformer Backbone Architecture

BoxerNet's core is a hierarchical transformer with an encoder-decoder structure:

  • Encoder: A stack of four multi-head self-attention (MSA) plus MLP layers processes the patch token sequence, employing standard hidden dimensionality (768) and 12 attention heads.
  • Decoder (box queries): Each 2D proposal is mapped via a linear projection to a "box query" bi2D=(x1,y1,x2,y2)b_i^{2D} = (x_1, y_1, x_2, y_2)^\top5. For each query, six layers of cross-attention (each box query attending to all encoder tokens, without inter-query self-attention) produce a latent vector bi2D=(x1,y1,x2,y2)b_i^{2D} = (x_1, y_1, x_2, y_2)^\top6, maintaining permutation invariance across proposals.
  • Output heads: Two MLP heads (each 2 layers, 128 hidden units, ReLU activation) process bi2D=(x1,y1,x2,y2)b_i^{2D} = (x_1, y_1, x_2, y_2)^\top7:
    • Headbi2D=(x1,y1,x2,y2)b_i^{2D} = (x_1, y_1, x_2, y_2)^\top8: Predicts the 3DBB parameters bi2D=(x1,y1,x2,y2)b_i^{2D} = (x_1, y_1, x_2, y_2)^\top9 for 3D center, size, and gravity-yaw angle.
    • Headsi2D[0,1]s_i^{2D} \in [0,1]0: Predicts si2D[0,1]s_i^{2D} \in [0,1]1, an aleatoric log-variance used for robust regression.

No box-specific self-attention is used; the design ensures computational efficiency and invariance.

3. Output Parameterization and Uncertainty

Each output 3DBB is represented as:

si2D[0,1]s_i^{2D} \in [0,1]2

The network predicts per-instance log-variance si2D[0,1]s_i^{2D} \in [0,1]3, modeling predictive aleatoric uncertainty. The 3D box confidence is:

si2D[0,1]s_i^{2D} \in [0,1]4

The final detection confidence used in downstream fusion is the mean of the 2D and 3D confidences:

si2D[0,1]s_i^{2D} \in [0,1]5

This design eliminates the need for end-to-end box classification or matching losses in BoxerNet.

4. Training Losses and Median Depth Encoding

The only learnable objective in BoxerNet is 3D box regression with uncertainty:

  • Regression loss: Using the Chamfer distance si2D[0,1]s_i^{2D} \in [0,1]6 between predicted and ground-truth box corners, the loss per proposal is

si2D[0,1]s_i^{2D} \in [0,1]7

as in Kendall & Gal (2017), penalizing both error and overconfident uncertainty assignments.

  • Median depth patch encoding: The image is partitioned into si2D[0,1]s_i^{2D} \in [0,1]8 patches. For each patch si2D[0,1]s_i^{2D} \in [0,1]9 with valid depth points IRH×W×3I \in \mathbb{R}^{H \times W \times 3}0, the median IRH×W×3I \in \mathbb{R}^{H \times W \times 3}1 is recorded; if no depth is present, IRH×W×3I \in \mathbb{R}^{H \times W \times 3}2. This provides scale/context cues while accommodating sparsity.

Training is performed end-to-end with >1.2 million unique 3DBBs; 2D proposal detectors and their losses are fixed.

5. Multi-View Fusion and Geometric Filtering

BoxerNet outputs per-frame 3DBBs, but global consistency and duplicate suppression necessitate post-hoc fusion:

  • Stage 1: For all pairs of 3DBBs across frames, compute the 3D IoU. Retain edges only where IRH×W×3I \in \mathbb{R}^{H \times W \times 3}3.
  • Stage 2: Evaluate semantic consistency using a text embedding (SBERT) of detector prompt IRH×W×3I \in \mathbb{R}^{H \times W \times 3}4; edges are kept if cosine similarity IRH×W×3I \in \mathbb{R}^{H \times W \times 3}5.
  • Stage 3: Build an undirected graph with nodes as boxes and edges for IoU and semantic pass. Connected components IRH×W×3I \in \mathbb{R}^{H \times W \times 3}6 represent spatiotemporally consistent clusters.
  • Stage 4: For each cluster, fuse predictions using detection confidence weights. Yaw (IRH×W×3I \in \mathbb{R}^{H \times W \times 3}7) values are aligned modulo 90°, then averaged; positions and sizes are fused by weighted Euclidean mean.
  • Stage 5: Non-maximum suppression (NMS) with threshold IRH×W×3I \in \mathbb{R}^{H \times W \times 3}8 is applied to yield the final, de-duplicated 3DBBs for the entire scene.

This fusion architecture is model-agnostic regarding the number of input views and robust to erratic single-frame detection.

6. Empirical Performance and System Characteristics

BoxerNet demonstrates high performance on open-world 3D bounding box estimation:

  • State-of-the-art mAP: On egocentric input without dense depth, BoxerNet achieves 0.532 mAP versus 0.010 mAP for CuTR. On the CA-1M dataset with dense depth, 0.412 versus 0.250 mAP.
  • Model complexity: BoxerNet contains approximately 25M parameters (excluding the DINOv3 backbone).
  • Pipeline throughput: The transformer backbone consists of 4 encoder and 6 decoder (cross-attention) layers, each with 12 heads, optimized for computational tractability and scaling to large numbers of proposals and image tokens.
  • Design decoupling: By leveraging existing 2D proposal methods, BoxerNet does not require 3D bounding box annotations during detector training, reducing dataset cost and supporting open-vocabulary generalization.

7. Summary Table: Architecture Components

Component Description Dimensionality/Specification
Image Features Extracted with DINOv3 backbone IRH×W×3I \in \mathbb{R}^{H \times W \times 3}9
Depth Patch Encoding Per-spatial patch median depth (optional) FimgRH×W×DF^{img} \in \mathbb{R}^{H'\times W'\times D}0
Ray Embeddings Gravity-aligned, per-patch unit vectors via unprojection FimgRH×W×DF^{img} \in \mathbb{R}^{H'\times W'\times D}1
Patch Tokens All above concatenated per patch FimgRH×W×DF^{img} \in \mathbb{R}^{H'\times W'\times D}2 tokens of FimgRH×W×DF^{img} \in \mathbb{R}^{H'\times W'\times D}3-dim
Transformer Encoder 4 MSA+MLP layers hidden dim FimgRH×W×DF^{img} \in \mathbb{R}^{H'\times W'\times D}4; FimgRH×W×DF^{img} \in \mathbb{R}^{H'\times W'\times D}5 heads
Box Query Decoder 6 cross-attention layers; 2D boxes mapped to queries FimgRH×W×DF^{img} \in \mathbb{R}^{H'\times W'\times D}6 queries, FimgRH×W×DF^{img} \in \mathbb{R}^{H'\times W'\times D}7-dim each
Output HeadFimgRH×W×DF^{img} \in \mathbb{R}^{H'\times W'\times D}8 MLP for 3DBB (center, size, yaw) 7-dim per proposal
Output HeadFimgRH×W×DF^{img} \in \mathbb{R}^{H'\times W'\times D}9 MLP for log-variance (aleatoric uncertainty) 1-dim per proposal
3D Box Fusion Graph clustering on D=768D=7680 and semantic similarity; confidence-weighted averaging Flexible, based on clusters

BoxerNet operationalizes robust, globally consistent open-world 3D object localization by modularly lifting 2D box proposals into 3D space, modeling geometric context, leveraging both dense and sparse depth, and providing principled uncertainty quantification (DeTone et al., 6 Apr 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BoxerNet Architecture.