Papers
Topics
Authors
Recent
Search
2000 character limit reached

Box Merging and Scoring

Updated 7 February 2026
  • Box merging and scoring is a process of fusing overlapping detection boxes from multiple models using confidence-weighted averaging.
  • Weighted Boxes Fusion clusters predictions based on IoU thresholds to generate consensus, leading to improved localization and higher mAP scores.
  • The method balances cross-model agreement with raw confidences, outperforming traditional NMS and Soft-NMS while incurring higher computational cost.

Box merging and scoring are fundamental components of object detection post-processing, particularly in the context of ensembling or consolidating detection results from multiple models or test-time augmentations. Instead of discarding overlapping predictions as in traditional approaches such as Non-Maximum Suppression (NMS), advanced methods like Weighted Boxes Fusion (WBF) utilize confidence-aware merging strategies to produce final bounding box predictions with improved localization and scoring fidelity (Solovyev et al., 2019).

1. Formal Definition and Problem Statement

Box merging refers to the fusion of multiple, often overlapping, detection bounding boxes—each typically associated with a confidence score—into a smaller set of consensus boxes. Given NN detectors (or NN variant predictions), each detector mm emits a set of boxes B(m)={(bi(m),ci(m))}B^{(m)} = \{(b_i^{(m)}, c_i^{(m)})\}, where bi(m)b_i^{(m)} are the box coordinates and ci(m)c_i^{(m)} their confidences. The goal is to merge all predictions into a list F={(bj,cj)}F = \{(b_j^*, c_j^*)\} that reflects both tight localization and cross-model agreement (Solovyev et al., 2019).

Box scoring is the process of assigning or recalibrating the confidence for each resulting box, often according to agreement among models, the distribution of input box confidences, or both.

2. Clustering and Merging Algorithms

Weighted Boxes Fusion defines the canonical approach. All predicted boxes from all sources are processed in descending order of confidence. Each box is either assigned to an existing cluster (if it overlaps with a current fused box by at least a threshold IoU, THR\mathrm{THR}) or starts a new cluster. Formally, for each box bb, assignment to cluster LkL_k requires IoU(b,bk)>THR\mathrm{IoU}(b, b_k^*) > \mathrm{THR} (Solovyev et al., 2019).

Within each cluster, the fused box coordinate bk=(x1,k,y1,k,x2,k,y2,k)b_k^* = (x_{1,k}^*, y_{1,k}^*, x_{2,k}^*, y_{2,k}^*) is computed as a confidence-weighted average:

x1,k=i=1Twix1,ii=1Twi,y1,k=i=1Twiy1,ii=1Twix_{1,k}^* = \frac{\sum_{i=1}^T w_i x_{1,i}}{\sum_{i=1}^T w_i}, \qquad y_{1,k}^* = \frac{\sum_{i=1}^T w_i y_{1,i}}{\sum_{i=1}^T w_i}

x2,k=i=1Twix2,ii=1Twi,y2,k=i=1Twiy2,ii=1Twix_{2,k}^* = \frac{\sum_{i=1}^T w_i x_{2,i}}{\sum_{i=1}^T w_i}, \qquad y_{2,k}^* = \frac{\sum_{i=1}^T w_i y_{2,i}}{\sum_{i=1}^T w_i}

where wiw_i is typically cic_i or an alternative monotonic transform such as ci2c_i^2 (Solovyev et al., 2019).

3. Box Scoring Strategies

Weighted Boxes Fusion decouples scoring from raw model confidence. The preliminary score for a merged box is the mean of all confidences in its cluster:

cˉk=1Ti=1Tci\bar{c}_k = \frac{1}{T} \sum_{i=1}^T c_i

To favor detections supported by all or most models, this score is multiplied by a consensus factor such as min(T,N)N\frac{\min(T, N)}{N} or TN\frac{T}{N}, where TT is the number of boxes merged and NN the number of models (Solovyev et al., 2019).

This penalizes isolated predictions (low TT) and rewards cross-model agreement (high TT). The result is a final confidence ckc_k^* that upweights consensus and downweights uncertain detections.

4. Algorithmic Workflows and Pseudocode

Weighted Boxes Fusion proceeds as follows:

  1. Input preparation: Aggregate all (bi,ci)(b_i, c_i) pairs from NN models.
  2. Sorting: Order all boxes by descending confidence.
  3. Clustering: For each box, assign to an existing cluster if IoU>THR\mathrm{IoU} > \mathrm{THR} with that cluster’s fused box; otherwise, start a new cluster.
  4. Fusing: For each cluster, update the fused coordinates using confidence-weighted averaging after each new assignment.
  5. Scoring: Calculate the mean confidence in each cluster and apply a consensus factor.
  6. Output: Return the set of final fused boxes with recalibrated scores.

Pseudocode reflecting these steps is described verbatim in (Solovyev et al., 2019). A table summarizing key hyperparameters is provided below.

Hyperparameter Typical Range Description
IoU threshold 0.55–0.7 Cluster assignment criterion
Weight function cic_i, ci2c_i^2 Confidence weighting for averaging
Consensus factor T/NT/N, min(T,N)/N\min(T,N)/N Re-scores by model agreement

5. Empirical Comparison to NMS and Soft-NMS

Traditional NMS suppresses all but the highest-confidence box in overlapping groups, discarding informative hypotheses. Soft-NMS decays the confidence of suppressed boxes without merging coordinates. Weighted Boxes Fusion uniquely retains all input hypotheses, merging both box coordinates and scores via weighted averaging (Solovyev et al., 2019).

Empirical evaluations on benchmarks (COCO, Open Images) show that WBF consistently outperforms both NMS and Soft-NMS across metrics such as mAP@[.5:.95]. For example, in a two-model EffDet-B6/B7 ensemble, WBF achieved 0.5344/0.7244/0.5824 (mAP@[.5:.95], .5, .75), surpassing NMS (0.5269/0.7156/0.5737) and Soft-NMS (0.5239/0.7121/0.5677). On heterogeneous Retinanet/MRCNN/HTC ensembles, WBF yielded 0.5982 (OpenImages @0.5 IoU), versus 0.5642 (NMS) and 0.5616 (Soft-NMS) (Solovyev et al., 2019).

However, WBF is 3×\approx 3\times slower than greedy NMS, with worst-case complexity O(MK)O(MK) (MM: total input boxes; KK: final output clusters; KMK \ll M in practice).

6. Limitations and Open Research Directions

Weighted Boxes Fusion relies on hyperparameter tuning, with threshold, weighting, and consensus factors usually selected by grid search. The method assumes pre-filtered, high-quality detections, and can degrade performance when used inside a single model’s raw output on noisy predictions. Making these parameters adaptive or differentiable for end-to-end learning is an open research question. Extending box merging and scoring to structured outputs such as rotated boxes, segmentations, or keypoints is also an active direction (Solovyev et al., 2019).

A plausible implication is that while box merging and scoring are most effective in ensemble settings, their utility inside standard object detector architectures remains limited by their sensitivity to confluence of false positives.

7. Application Scope and Impact

Box merging and scoring operates as a post-processing ensemble step, with particular effectiveness in multi-model and multi-augmentation fusion pipelines for large-scale object detection. Empirically, WBF has secured high leaderboard positions on COCO and Open Images, demonstrating state-of-the-art aggregate performance. The approach preserves localization accuracy in dense or ambiguous scenarios, leveraging the diversity and consensus properties of modern detector ensembles, with practical tradeoffs hinging on runtime and calibration methodology (Solovyev et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Box Merging and Scoring.