BEV-Based NMS for 3D Object Detection

Updated 25 December 2025

BEV-based NMS is an in-graph, class-agnostic suppression mechanism that removes duplicate anchors by operating on rotated 2D bounding boxes in the BEV plane.
It computes intersection-over-union (IoU) on projected 2D boxes using CUDA acceleration, ensuring efficient and differentiable suppression in transformer decoders.
Empirical results on nuScenes and Waymo demonstrate notable mAP improvements and enhanced recall, particularly for small and densely clustered objects.

A Bird’s-Eye-View (BEV)-based Non-Maximum Suppression (NMS) is an in-graph, class-agnostic suppression mechanism that reduces redundancy among detection hypotheses in 3D object detection pipelines, specifically those leveraging dense BEV grid representations. Originally introduced within the DenseBEV framework, BEV-based NMS operates directly on rotated 2D bounding boxes projected onto the BEV plane, offering computational tractability and gradient backpropagation advantages compared to traditional post-processing NMS approaches for 3D detectors (Dähling et al., 18 Dec 2025).

1. Motivation and Rationale

BEV-based NMS addresses the prohibitive computational and learning inefficiencies that arise in transformer-based multi-camera 3D object detectors employing dense BEV grids. In frameworks where every cell within an $N \times M$ BEV grid (for example, $200 \times 200$ yielding $40\,000$ anchors) is considered an object anchor, naively passing all anchor queries to a DETR-style decoder is infeasible due to quadratic scaling and rampant duplicate hypotheses. Traditional NMS in 3D space, which operates on rotated 3D bounding boxes, incurs high computational expense (each pairwise comparison requires full 3D box IoU), and as a non-differentiable operation applied post-inference, precludes the model from learning to de-duplicate proposals early in the pipeline. In contrast, suppression based on BEV (top-down) IoU exploits the sparseness of overlapping objects in the height (z) dimension for automotive scenes, enabling efficient duplicate removal and earlier focus of gradient flow (Dähling et al., 18 Dec 2025).

2. Mathematical Formulation

BEV-based NMS requires the computation of intersection-over-union (IoU) between pairs of rotated 2D boxes projected onto the ground plane. Each box $b_i$ is defined by $(x_i, y_i, \theta_i, w_i, \ell_i)$ : center coordinates, orientation, width, and length. The overlap and IoU are given as: $\text{overlap}(b_i, b_j) = \text{area}(b_i \cap b_j)$

$\text{IoU}_{\text{BEV}}(b_i, b_j) = \frac{\text{area}(b_i \cap b_j)}{\text{area}(b_i) + \text{area}(b_j) - \text{area}(b_i \cap b_j)}$

These pairwise computations are CUDA-accelerated, employing routines from OpenPCDet (Dähling et al., 18 Dec 2025).

3. Algorithmic Workflow

BEV-based NMS is implemented both before the transformer decoder and recursively within each decoder layer. The process is as follows:

For grid $N\times M$ with $Q = N\cdot M$ candidate queries, each cell predicts a BEV box $b_k$ and a confidence score $s_k$ via an auxiliary head.
Candidates are sorted by score. Iteratively, each candidate is compared to those already selected; if the IoU exceeds a threshold $\tau$ , it is suppressed.
After initial suppression, top $K_{\text{in}}$ queries are retained for decoding ( $K_{\text{in}}=900$ ).
An attention-mask $A$ of shape $[Q, Q]$ is constructed:

$A_{kl} = \begin{cases} 0, & k \in \text{keep},\ l \in \text{keep},\ 1, & \text{otherwise} \end{cases}$

Suppressed queries are blocked from attention, and gradient flow is zeroed for these entries.

At each transformer decoder layer, new box/score predictions are computed, BEV NMS is re-run, and the mask $A$ is updated. Only "winning" queries—those not suppressed—propagate gradients and are refined.

Algorithm Pseudocode:

def BEV_NMS(boxes, scores, threshold):
    idxs = argsort_desc(scores)
    keep = []
    for i in idxs:
        suppress = False
        for j in keep:
            if IoU_BEV(boxes[i], boxes[j]) > threshold:
                suppress = True
                break
        if not suppress:
            keep.append(i)
    return keep

(Dähling et al., 18 Dec 2025)

4. Hyper-Parameter Space

Key parameters in BEV-based NMS modulate suppression aggressiveness and decoding capacity:

Parameter	Typical Value(s)	Role
$\tau$	0.1–0.2	BEV-IoU threshold for NMS
$K_{\text{in}}$	900	Number of decoder queries
Memory queue	300	Top-K temporal queries (hybrid modeling)

Lower $\tau$ (e.g., 0.1) increases suppression, yielding fewer retained queries, while higher $\tau$ (above 0.3) degrades performance.
$K_{\text{in}}$ is set for decoder capacity, and memory queue length is actionable only in scenarios involving temporal refinement.

5. Integration into Transformer-Based 3D Detection Pipelines

Integration proceeds as follows: post BEV-encoder (e.g., BEVFormer) produces feature map $G^{t} \in \mathbb{R}^{N \times M \times C}$ ; each cell predicts a BEV box and score; BEV-NMS is executed to cull redundant hypotheses; top $K_{\text{in}}$ queries are decoded by the transformer. During decoding, suppression masks are recomputed at every layer, ensuring that only non-suppressed hypotheses can attend, update, and receive gradients. Model complexity is dominated by the $O(Q^2)$ pre-decoder NMS, but implementation with CUDA and aggressive early culling makes it tractable. Decoder computational load remains invariant due to fixed $K_{\text{in}}$ . The memory footprint is minimal, primarily an extra attention-mask and auxiliary heads. Empirical run-time overhead is approximately +18.5% on a single 2080Ti GPU (Dähling et al., 18 Dec 2025).

6. Comparison to Standard Post-Processing NMS

BEV-based NMS is differentiated from standard post-processing NMS in several ways:

Differentiability: Standard 3D NMS is non-differentiable, applied after network output, and lacks influence over duplicate proposal formation within the model. BEV-NMS is incorporated "in-loop," allowing the network to learn object separation early.
Efficiency: BEV suppression is less computationally intensive, as it omits the height (z) overlap, suitable for automotive environments where vertical overlaps are infrequent.
Gradient Propagation: Suppressed BEV queries do not participate in attention and their gradients are zeroed, encouraging stronger objectness and removal of duplicates throughout the training process.
Empirical Performance: DenseBEV demonstrates improved recall for small or dense objects (e.g., traffic cones, pedestrians) due to dense yet early-filtered anchor generation, outperforming random query or post-hoc NMS approaches (Dähling et al., 18 Dec 2025).

7. Implications and Empirical Outcomes

Empirical results on nuScenes and Waymo Open datasets demonstrate that BEV-based NMS, as integrated in DenseBEV, yields notable improvements: a 3.8% mAP increase in pedestrian detection on nuScenes and 8% LET-mAP increase on Waymo. DenseBEV surpasses previous state-of-the-art models by 5.4% LET-mAP (60.7% total) on Waymo. Enhanced performance is particularly evident in regaining small-object recall without a prohibitive anchor count. A plausible implication is that dense BEV anchor utilization, coupled with in-loop differentiable suppression, sets a new efficiency and performance standard for transformer-based 3D detectors in automotive settings (Dähling et al., 18 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

DenseBEV: Transforming BEV Grid Cells into 3D Objects (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to BEV-Based Non-Maximum Suppression.