Box Merging and Scoring
- Box merging and scoring is a process of fusing overlapping detection boxes from multiple models using confidence-weighted averaging.
- Weighted Boxes Fusion clusters predictions based on IoU thresholds to generate consensus, leading to improved localization and higher mAP scores.
- The method balances cross-model agreement with raw confidences, outperforming traditional NMS and Soft-NMS while incurring higher computational cost.
Box merging and scoring are fundamental components of object detection post-processing, particularly in the context of ensembling or consolidating detection results from multiple models or test-time augmentations. Instead of discarding overlapping predictions as in traditional approaches such as Non-Maximum Suppression (NMS), advanced methods like Weighted Boxes Fusion (WBF) utilize confidence-aware merging strategies to produce final bounding box predictions with improved localization and scoring fidelity (Solovyev et al., 2019).
1. Formal Definition and Problem Statement
Box merging refers to the fusion of multiple, often overlapping, detection bounding boxes—each typically associated with a confidence score—into a smaller set of consensus boxes. Given detectors (or variant predictions), each detector emits a set of boxes , where are the box coordinates and their confidences. The goal is to merge all predictions into a list that reflects both tight localization and cross-model agreement (Solovyev et al., 2019).
Box scoring is the process of assigning or recalibrating the confidence for each resulting box, often according to agreement among models, the distribution of input box confidences, or both.
2. Clustering and Merging Algorithms
Weighted Boxes Fusion defines the canonical approach. All predicted boxes from all sources are processed in descending order of confidence. Each box is either assigned to an existing cluster (if it overlaps with a current fused box by at least a threshold IoU, ) or starts a new cluster. Formally, for each box , assignment to cluster requires (Solovyev et al., 2019).
Within each cluster, the fused box coordinate is computed as a confidence-weighted average:
where is typically or an alternative monotonic transform such as (Solovyev et al., 2019).
3. Box Scoring Strategies
Weighted Boxes Fusion decouples scoring from raw model confidence. The preliminary score for a merged box is the mean of all confidences in its cluster:
To favor detections supported by all or most models, this score is multiplied by a consensus factor such as or , where is the number of boxes merged and the number of models (Solovyev et al., 2019).
This penalizes isolated predictions (low ) and rewards cross-model agreement (high ). The result is a final confidence that upweights consensus and downweights uncertain detections.
4. Algorithmic Workflows and Pseudocode
Weighted Boxes Fusion proceeds as follows:
- Input preparation: Aggregate all pairs from models.
- Sorting: Order all boxes by descending confidence.
- Clustering: For each box, assign to an existing cluster if with that cluster’s fused box; otherwise, start a new cluster.
- Fusing: For each cluster, update the fused coordinates using confidence-weighted averaging after each new assignment.
- Scoring: Calculate the mean confidence in each cluster and apply a consensus factor.
- Output: Return the set of final fused boxes with recalibrated scores.
Pseudocode reflecting these steps is described verbatim in (Solovyev et al., 2019). A table summarizing key hyperparameters is provided below.
| Hyperparameter | Typical Range | Description |
|---|---|---|
| IoU threshold | 0.55–0.7 | Cluster assignment criterion |
| Weight function | , | Confidence weighting for averaging |
| Consensus factor | , | Re-scores by model agreement |
5. Empirical Comparison to NMS and Soft-NMS
Traditional NMS suppresses all but the highest-confidence box in overlapping groups, discarding informative hypotheses. Soft-NMS decays the confidence of suppressed boxes without merging coordinates. Weighted Boxes Fusion uniquely retains all input hypotheses, merging both box coordinates and scores via weighted averaging (Solovyev et al., 2019).
Empirical evaluations on benchmarks (COCO, Open Images) show that WBF consistently outperforms both NMS and Soft-NMS across metrics such as mAP@[.5:.95]. For example, in a two-model EffDet-B6/B7 ensemble, WBF achieved 0.5344/0.7244/0.5824 (mAP@[.5:.95], .5, .75), surpassing NMS (0.5269/0.7156/0.5737) and Soft-NMS (0.5239/0.7121/0.5677). On heterogeneous Retinanet/MRCNN/HTC ensembles, WBF yielded 0.5982 (OpenImages @0.5 IoU), versus 0.5642 (NMS) and 0.5616 (Soft-NMS) (Solovyev et al., 2019).
However, WBF is slower than greedy NMS, with worst-case complexity (: total input boxes; : final output clusters; in practice).
6. Limitations and Open Research Directions
Weighted Boxes Fusion relies on hyperparameter tuning, with threshold, weighting, and consensus factors usually selected by grid search. The method assumes pre-filtered, high-quality detections, and can degrade performance when used inside a single model’s raw output on noisy predictions. Making these parameters adaptive or differentiable for end-to-end learning is an open research question. Extending box merging and scoring to structured outputs such as rotated boxes, segmentations, or keypoints is also an active direction (Solovyev et al., 2019).
A plausible implication is that while box merging and scoring are most effective in ensemble settings, their utility inside standard object detector architectures remains limited by their sensitivity to confluence of false positives.
7. Application Scope and Impact
Box merging and scoring operates as a post-processing ensemble step, with particular effectiveness in multi-model and multi-augmentation fusion pipelines for large-scale object detection. Empirically, WBF has secured high leaderboard positions on COCO and Open Images, demonstrating state-of-the-art aggregate performance. The approach preserves localization accuracy in dense or ambiguous scenarios, leveraging the diversity and consensus properties of modern detector ensembles, with practical tradeoffs hinging on runtime and calibration methodology (Solovyev et al., 2019).