BoxerNet: Transformer-based 3D Localization
- BoxerNet is a transformer-based neural architecture that lifts 2D detections into 3D bounding boxes using multi-modal inputs such as image features, depth, and ray embeddings.
- It employs a hierarchical encoder-decoder design with attention-based fusion to integrate geometric and visual data for robust 3D localization.
- By decoupling 2D detection from 3D localization and using uncertainty modeling and geometric post-processing, BoxerNet achieves state-of-the-art performance in open-world settings.
BoxerNet is a transformer-based neural architecture for lifting 2D open-vocabulary object detections into 3D bounding box estimates using posed images, optional depth (dense or sparse), and geometric information. It constitutes the core of the Boxer system, which decouples 2D detection from 3D localization and achieves state-of-the-art results in metric 3D bounding box (3DBB) estimation in open-world settings. BoxerNet is characterized by a modular pipeline, attention-based fusion, robust aleatoric uncertainty modeling, and multi-view geometric post-processing (DeTone et al., 6 Apr 2026).
1. Input Representations and Preprocessing
BoxerNet inputs are constructed from multiple signals:
- 2D proposals: Off-the-shelf open-world object detectors (e.g., DETIC, OWLv2, SAM3) generate 2D bounding boxes, , accompanied by score .
- Image features: Each input image is processed by a DINOv3 vision backbone to extract , with .
- Depth/point clouds (optional): Per-frame LiDAR, SLAM map points, or dense depth values are projected into the image and aggregated in non-overlapping patches. Within each patch, the median camera-frame depth is computed to form ; patches without samples are assigned . This encoding is robust for both sparse and dense depth availability.
- Ray embeddings: For each patch, the center pixel is mapped—using camera intrinsics and 6-DoF pose (excluding translation, working in gravity-aligned camera frame )—to a unit 3D ray vector, aggregated as 0.
- Tokenization: At every spatial location 1, features are concatenated: 2, yielding 3 tokens of dimension 4.
This multi-modal encoding conditions BoxerNet on image appearance, geometry, depth/scale hints, and camera calibration.
2. Transformer Backbone Architecture
BoxerNet's core is a hierarchical transformer with an encoder-decoder structure:
- Encoder: A stack of four multi-head self-attention (MSA) plus MLP layers processes the patch token sequence, employing standard hidden dimensionality (768) and 12 attention heads.
- Decoder (box queries): Each 2D proposal is mapped via a linear projection to a "box query" 5. For each query, six layers of cross-attention (each box query attending to all encoder tokens, without inter-query self-attention) produce a latent vector 6, maintaining permutation invariance across proposals.
- Output heads: Two MLP heads (each 2 layers, 128 hidden units, ReLU activation) process 7:
- Head8: Predicts the 3DBB parameters 9 for 3D center, size, and gravity-yaw angle.
- Head0: Predicts 1, an aleatoric log-variance used for robust regression.
No box-specific self-attention is used; the design ensures computational efficiency and invariance.
3. Output Parameterization and Uncertainty
Each output 3DBB is represented as:
2
The network predicts per-instance log-variance 3, modeling predictive aleatoric uncertainty. The 3D box confidence is:
4
The final detection confidence used in downstream fusion is the mean of the 2D and 3D confidences:
5
This design eliminates the need for end-to-end box classification or matching losses in BoxerNet.
4. Training Losses and Median Depth Encoding
The only learnable objective in BoxerNet is 3D box regression with uncertainty:
- Regression loss: Using the Chamfer distance 6 between predicted and ground-truth box corners, the loss per proposal is
7
as in Kendall & Gal (2017), penalizing both error and overconfident uncertainty assignments.
- Median depth patch encoding: The image is partitioned into 8 patches. For each patch 9 with valid depth points 0, the median 1 is recorded; if no depth is present, 2. This provides scale/context cues while accommodating sparsity.
Training is performed end-to-end with >1.2 million unique 3DBBs; 2D proposal detectors and their losses are fixed.
5. Multi-View Fusion and Geometric Filtering
BoxerNet outputs per-frame 3DBBs, but global consistency and duplicate suppression necessitate post-hoc fusion:
- Stage 1: For all pairs of 3DBBs across frames, compute the 3D IoU. Retain edges only where 3.
- Stage 2: Evaluate semantic consistency using a text embedding (SBERT) of detector prompt 4; edges are kept if cosine similarity 5.
- Stage 3: Build an undirected graph with nodes as boxes and edges for IoU and semantic pass. Connected components 6 represent spatiotemporally consistent clusters.
- Stage 4: For each cluster, fuse predictions using detection confidence weights. Yaw (7) values are aligned modulo 90°, then averaged; positions and sizes are fused by weighted Euclidean mean.
- Stage 5: Non-maximum suppression (NMS) with threshold 8 is applied to yield the final, de-duplicated 3DBBs for the entire scene.
This fusion architecture is model-agnostic regarding the number of input views and robust to erratic single-frame detection.
6. Empirical Performance and System Characteristics
BoxerNet demonstrates high performance on open-world 3D bounding box estimation:
- State-of-the-art mAP: On egocentric input without dense depth, BoxerNet achieves 0.532 mAP versus 0.010 mAP for CuTR. On the CA-1M dataset with dense depth, 0.412 versus 0.250 mAP.
- Model complexity: BoxerNet contains approximately 25M parameters (excluding the DINOv3 backbone).
- Pipeline throughput: The transformer backbone consists of 4 encoder and 6 decoder (cross-attention) layers, each with 12 heads, optimized for computational tractability and scaling to large numbers of proposals and image tokens.
- Design decoupling: By leveraging existing 2D proposal methods, BoxerNet does not require 3D bounding box annotations during detector training, reducing dataset cost and supporting open-vocabulary generalization.
7. Summary Table: Architecture Components
| Component | Description | Dimensionality/Specification |
|---|---|---|
| Image Features | Extracted with DINOv3 backbone | 9 |
| Depth Patch Encoding | Per-spatial patch median depth (optional) | 0 |
| Ray Embeddings | Gravity-aligned, per-patch unit vectors via unprojection | 1 |
| Patch Tokens | All above concatenated per patch | 2 tokens of 3-dim |
| Transformer Encoder | 4 MSA+MLP layers | hidden dim 4; 5 heads |
| Box Query Decoder | 6 cross-attention layers; 2D boxes mapped to queries | 6 queries, 7-dim each |
| Output Head8 | MLP for 3DBB (center, size, yaw) | 7-dim per proposal |
| Output Head9 | MLP for log-variance (aleatoric uncertainty) | 1-dim per proposal |
| 3D Box Fusion | Graph clustering on 0 and semantic similarity; confidence-weighted averaging | Flexible, based on clusters |
BoxerNet operationalizes robust, globally consistent open-world 3D object localization by modularly lifting 2D box proposals into 3D space, modeling geometric context, leveraging both dense and sparse depth, and providing principled uncertainty quantification (DeTone et al., 6 Apr 2026).